How to Create a PySpark DataFrame with Nested Structs or Arrays: The Ultimate Guide
Published on April 17, 2025
Diving Straight into Creating PySpark DataFrames with Nested Structs or Arrays
Want to build a PySpark DataFrame with complex, nested structures—like employee records with contact details or project lists—and harness them for big data analytics? Creating a DataFrame with nested structs or arrays is a powerful skill for data engineers crafting ETL pipelines with Apache Spark. These structures model hierarchical or one-to-many relationships, enabling rich queries on semi-structured data. This guide dives into the syntax and steps for creating a PySpark DataFrame with nested structs or arrays, with examples covering simple to complex scenarios. We’ll tackle key errors to keep your pipelines robust. Let’s build those nested DataFrames! For more on PySpark, see Introduction to PySpark.
Creating a DataFrame with Nested Structs or Arrays
The primary method for creating a PySpark DataFrame with nested structs or arrays is the createDataFrame method of the SparkSession, paired with a predefined schema using StructType and ArrayType. The SparkSession, Spark’s unified entry point, allows you to define complex schemas to represent nested data, such as structs (for nested objects) or arrays (for lists). This approach is ideal for ETL pipelines handling semi-structured data, like JSON or database records. Here’s the basic syntax:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, ArrayType
spark = SparkSession.builder.appName("NestedDataFrame").getOrCreate()
schema = StructType([
StructField("id", StringType(), False),
StructField("data", StructType([
StructField("field1", StringType(), True)
])), # Nested struct
StructField("items", ArrayType(StringType()), True) # Array
])
data = [(id, data, items), ...]
df = spark.createDataFrame(data, schema)
Let’s apply it to employee data with a nested contact struct (phone, email) and a projects array, a common ETL scenario:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, LongType, ArrayType
# Initialize SparkSession
spark = SparkSession.builder.appName("NestedDataFrame").getOrCreate()
# Define schema with nested struct and array
schema = StructType([
StructField("employee_id", StringType(), False),
StructField("name", StringType(), True),
StructField("contact", StructType([
StructField("phone", LongType(), True),
StructField("email", StringType(), True)
]), True),
StructField("projects", ArrayType(StringType()), True)
])
# Sample data
data = [
("E001", "Alice", (1234567890, "alice@example.com"), ["Project A", "Project B"]),
("E002", "Bob", (9876543210, "bob@example.com"), ["Project C"]),
("E003", "Cathy", (None, None), [])
]
# Create DataFrame
df = spark.createDataFrame(data, schema)
df.show(truncate=False)
df.printSchema()
Output:
+-----------+-----+--------------------------------+---------------------+
|employee_id|name |contact |projects |
+-----------+-----+--------------------------------+---------------------+
|E001 |Alice|[1234567890, alice@example.com] |[Project A, Project B]|
|E002 |Bob |[9876543210, bob@example.com] |[Project C] |
|E003 |Cathy|[null, null] |[] |
+-----------+-----+--------------------------------+---------------------+
root
|-- employee_id: string (nullable = false)
|-- name: string (nullable = true)
|-- contact: struct (nullable = true)
| |-- phone: long (nullable = true)
| |-- email: string (nullable = true)
|-- projects: array (nullable = true)
| |-- element: string (containsNull = true)
This DataFrame supports queries on nested fields (e.g., contact.email) and arrays (e.g., exploding projects). Validate schema: assert isinstance(df.schema["contact"].dataType, StructType), "Nested struct missing". For SparkSession details, see SparkSession in PySpark.
Creating a DataFrame with a Simple Nested Struct
A simple nested struct, like a contact field with phone and email, models hierarchical data, ideal for ETL tasks representing structured objects, such as user profiles, as seen in ETL Pipelines. The StructType defines the nested structure:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, LongType
spark = SparkSession.builder.appName("SimpleNestedStruct").getOrCreate()
# Define schema with nested struct
schema_simple = StructType([
StructField("employee_id", StringType(), False),
StructField("name", StringType(), True),
StructField("contact", StructType([
StructField("phone", LongType(), True),
StructField("email", StringType(), True)
]), True)
])
# Sample data
data_simple = [
("E001", "Alice", (1234567890, "alice@example.com")),
("E002", "Bob", (9876543210, "bob@example.com")),
("E003", "Cathy", (None, None))
]
# Create DataFrame
df_simple = spark.createDataFrame(data_simple, schema_simple)
df_simple.show(truncate=False)
Output:
+-----------+-----+--------------------------------+
|employee_id|name |contact |
+-----------+-----+--------------------------------+
|E001 |Alice|[1234567890, alice@example.com] |
|E002 |Bob |[9876543210, bob@example.com] |
|E003 |Cathy|[null, null] |
+-----------+-----+--------------------------------+
This DataFrame enables queries like SELECT contact.email. Error to Watch: Mismatched struct data fails:
data_invalid = [("E001", "Alice", (1234567890,))] # Incomplete struct
try:
df_invalid = spark.createDataFrame(data_invalid, schema_simple)
df_invalid.show()
except Exception as e:
print(f"Error: {e}")
Output:
Error: field contact: StructType(...) can not accept object (1234567890,) in type
Fix: Ensure struct completeness: data_clean = [(r[0], r[1], (r[2][0], None)) for r in data_invalid]. Validate: [len(row[2]) == 2 for row in data_simple].
Creating a DataFrame with a Simple Array
A simple array, like a list of projects, models one-to-many relationships, perfect for ETL tasks tracking dynamic data, such as employee assignments, as discussed in Explode Function Deep Dive. The ArrayType defines the list structure:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, ArrayType
spark = SparkSession.builder.appName("SimpleArray").getOrCreate()
# Define schema with array
schema_array = StructType([
StructField("employee_id", StringType(), False),
StructField("name", StringType(), True),
StructField("projects", ArrayType(StringType()), True)
])
# Sample data
data_array = [
("E001", "Alice", ["Project A", "Project B"]),
("E002", "Bob", ["Project C"]),
("E003", "Cathy", [])
]
# Create DataFrame
df_array = spark.createDataFrame(data_array, schema_array)
df_array.show(truncate=False)
Output:
+-----------+-----+--------------------+
|employee_id|name |projects |
+-----------+-----+--------------------+
|E001 |Alice|[Project A, Project B]|
|E002 |Bob |[Project C] |
|E003 |Cathy|[] |
+-----------+-----+--------------------+
This supports array operations like exploding. Ensure projects contains lists.
Creating a DataFrame with Nested Structs
Nested structs, like a details struct containing contact and address, model deeply hierarchical data, extending simple structs for complex ETL tasks, as seen in DataFrame UDFs:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, LongType
spark = SparkSession.builder.appName("NestedStructs").getOrCreate()
# Define schema with nested structs
schema_nested = StructType([
StructField("employee_id", StringType(), False),
StructField("name", StringType(), True),
StructField("details", StructType([
StructField("contact", StructType([
StructField("phone", LongType(), True),
StructField("email", StringType(), True)
]), True),
StructField("address", StructType([
StructField("city", StringType(), True)
]), True)
]), True)
])
# Sample data
data_nested = [
("E001", "Alice", ((1234567890, "alice@example.com"), ("New York",))),
("E002", "Bob", ((9876543210, "bob@example.com"), ("Boston",)))
]
# Create DataFrame
df_nested = spark.createDataFrame(data_nested, schema_nested)
df_nested.show(truncate=False)
Output:
+-----------+-----+-----------------------------------------------+
|employee_id|name |details |
+-----------+-----+-----------------------------------------------+
|E001 |Alice|[{1234567890, alice@example.com}, {New York}] |
|E002 |Bob |[{9876543210, bob@example.com}, {Boston}] |
+-----------+-----+-----------------------------------------------+
This enables queries like SELECT details.contact.email. Ensure nested data matches the schema depth.
Creating a DataFrame with Arrays of Structs
Arrays of structs, like a list of skills with years and certifications, model complex one-to-many relationships, extending simple arrays for advanced ETL analytics, as discussed in DataFrame UDFs:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType
spark = SparkSession.builder.appName("ArrayOfStructs").getOrCreate()
# Define schema with array of structs
schema_complex = StructType([
StructField("employee_id", StringType(), False),
StructField("name", StringType(), True),
StructField("skills", ArrayType(StructType([
StructField("year", IntegerType(), True),
StructField("certification", StringType(), True)
])), True)
])
# Sample data
data_complex = [
("E001", "Alice", [(2023, "Python"), (2024, "Spark")]),
("E002", "Bob", [(2022, "Java")]),
("E003", "Cathy", [])
]
# Create DataFrame
df_complex = spark.createDataFrame(data_complex, schema_complex)
df_complex.show(truncate=False)
Output:
+-----------+-----+---------------------------------+
|employee_id|name |skills |
+-----------+-----+---------------------------------+
|E001 |Alice|[{2023, Python}, {2024, Spark}] |
|E002 |Bob |[{2022, Java}] |
|E003 |Cathy|[] |
+-----------+-----+---------------------------------+
This supports queries like exploding skills. Error to Watch: Incomplete structs in arrays fail:
data_invalid = [("E001", "Alice", [(2023,)])]
try:
df_invalid = spark.createDataFrame(data_invalid, schema_complex)
df_invalid.show()
except Exception as e:
print(f"Error: {e}")
Output:
Error: field skills: ArrayType(StructType(...)) can not accept object [(2023,)] in type
Fix: Clean: data_clean = [(r[0], r[1], [(y, None) for y in r[2]]) for r in data_invalid]. Validate: [all(len(s) == 2 for s in row[2]) for row in data_complex].
Handling Null Values in Nested Structures
Null values in structs or arrays, like missing contacts or empty project lists, are common in semi-structured data. A schema with nullable=True handles these, extending arrays of structs for robust ETL pipelines, as seen in Column Null Handling. The above examples already include nulls (e.g., contact as (None, None)), managed by the schema.
How to Fix Common DataFrame Creation Errors
Errors can disrupt nested DataFrame creation. Here are key issues, with fixes:
- Mismatched Struct Data: Incomplete structs fail. Fix: data_clean = [(r[0], r[1], (r[2][0], None)) for r in data_invalid]. Validate: [len(row[2]) == 2 for row in data].
- Invalid Array Elements: Non-list or mismatched structs in arrays fail. Fix: data_clean = [(r[0], r[1], [r[2]] if isinstance(r[2], str) else r[2]) for r in data]. Validate: [isinstance(row[2], list) for row in data].
- Schema Mismatch: Wrong schema types fail. Fix: Align schema with data. Validate: df.printSchema().
For more, see Error Handling and Debugging.
Wrapping Up Your DataFrame Creation Mastery
Creating a PySpark DataFrame with nested structs or arrays is a vital skill, and Spark’s createDataFrame method makes it easy to handle simple structs, arrays, and complex nested structures. These techniques will level up your ETL pipelines. Try them in your next Spark job, and share tips or questions in the comments or on X. Keep exploring with DataFrame Operations!