How to Create a PySpark DataFrame with Nested Structs or Arrays: The Ultimate Guide

Published on April 17, 2025

Diving Straight into Creating PySpark DataFrames with Nested Structs or Arrays

Want to build a PySpark DataFrame with complex, nested structures—like employee records with contact details or project lists—and harness them for big data analytics? Creating a DataFrame with nested structs or arrays is a powerful skill for data engineers crafting ETL pipelines with Apache Spark. These structures model hierarchical or one-to-many relationships, enabling rich queries on semi-structured data. This guide dives into the syntax and steps for creating a PySpark DataFrame with nested structs or arrays, with examples covering simple to complex scenarios. We’ll tackle key errors to keep your pipelines robust. Let’s build those nested DataFrames! For more on PySpark, see Introduction to PySpark.

Creating a DataFrame with Nested Structs or Arrays

The primary method for creating a PySpark DataFrame with nested structs or arrays is the createDataFrame method of the SparkSession, paired with a predefined schema using StructType and ArrayType. The SparkSession, Spark’s unified entry point, allows you to define complex schemas to represent nested data, such as structs (for nested objects) or arrays (for lists). This approach is ideal for ETL pipelines handling semi-structured data, like JSON or database records. Here’s the basic syntax:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, ArrayType

spark = SparkSession.builder.appName("NestedDataFrame").getOrCreate()
schema = StructType([
    StructField("id", StringType(), False),
    StructField("data", StructType([
        StructField("field1", StringType(), True)
    ])),  # Nested struct
    StructField("items", ArrayType(StringType()), True)  # Array
])
data = [(id, data, items), ...]
df = spark.createDataFrame(data, schema)

Let’s apply it to employee data with a nested contact struct (phone, email) and a projects array, a common ETL scenario:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, LongType, ArrayType

# Initialize SparkSession
spark = SparkSession.builder.appName("NestedDataFrame").getOrCreate()

# Define schema with nested struct and array
schema = StructType([
    StructField("employee_id", StringType(), False),
    StructField("name", StringType(), True),
    StructField("contact", StructType([
        StructField("phone", LongType(), True),
        StructField("email", StringType(), True)
    ]), True),
    StructField("projects", ArrayType(StringType()), True)
])

# Sample data
data = [
    ("E001", "Alice", (1234567890, "alice@example.com"), ["Project A", "Project B"]),
    ("E002", "Bob", (9876543210, "bob@example.com"), ["Project C"]),
    ("E003", "Cathy", (None, None), [])
]

# Create DataFrame
df = spark.createDataFrame(data, schema)
df.show(truncate=False)
df.printSchema()

Output:

+-----------+-----+--------------------------------+---------------------+
|employee_id|name |contact                         |projects             |
+-----------+-----+--------------------------------+---------------------+
|E001       |Alice|[1234567890, alice@example.com] |[Project A, Project B]|
|E002       |Bob  |[9876543210, bob@example.com]   |[Project C]          |
|E003       |Cathy|[null, null]                    |[]                   |
+-----------+-----+--------------------------------+---------------------+

root
 |-- employee_id: string (nullable = false)
 |-- name: string (nullable = true)
 |-- contact: struct (nullable = true)
 |    |-- phone: long (nullable = true)
 |    |-- email: string (nullable = true)
 |-- projects: array (nullable = true)
 |    |-- element: string (containsNull = true)

This DataFrame supports queries on nested fields (e.g., contact.email) and arrays (e.g., exploding projects). Validate schema: assert isinstance(df.schema["contact"].dataType, StructType), "Nested struct missing". For SparkSession details, see SparkSession in PySpark.

Creating a DataFrame with a Simple Nested Struct

A simple nested struct, like a contact field with phone and email, models hierarchical data, ideal for ETL tasks representing structured objects, such as user profiles, as seen in ETL Pipelines. The StructType defines the nested structure:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, LongType

spark = SparkSession.builder.appName("SimpleNestedStruct").getOrCreate()

# Define schema with nested struct
schema_simple = StructType([
    StructField("employee_id", StringType(), False),
    StructField("name", StringType(), True),
    StructField("contact", StructType([
        StructField("phone", LongType(), True),
        StructField("email", StringType(), True)
    ]), True)
])

# Sample data
data_simple = [
    ("E001", "Alice", (1234567890, "alice@example.com")),
    ("E002", "Bob", (9876543210, "bob@example.com")),
    ("E003", "Cathy", (None, None))
]

# Create DataFrame
df_simple = spark.createDataFrame(data_simple, schema_simple)
df_simple.show(truncate=False)

Output:

+-----------+-----+--------------------------------+
|employee_id|name |contact                         |
+-----------+-----+--------------------------------+
|E001       |Alice|[1234567890, alice@example.com] |
|E002       |Bob  |[9876543210, bob@example.com]   |
|E003       |Cathy|[null, null]                    |
+-----------+-----+--------------------------------+

This DataFrame enables queries like SELECT contact.email. Error to Watch: Mismatched struct data fails:

data_invalid = [("E001", "Alice", (1234567890,))]  # Incomplete struct
try:
    df_invalid = spark.createDataFrame(data_invalid, schema_simple)
    df_invalid.show()
except Exception as e:
    print(f"Error: {e}")

Output:

Error: field contact: StructType(...) can not accept object (1234567890,) in type

Fix: Ensure struct completeness: data_clean = [(r[0], r[1], (r[2][0], None)) for r in data_invalid]. Validate: [len(row[2]) == 2 for row in data_simple].

Creating a DataFrame with a Simple Array

A simple array, like a list of projects, models one-to-many relationships, perfect for ETL tasks tracking dynamic data, such as employee assignments, as discussed in Explode Function Deep Dive. The ArrayType defines the list structure:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, ArrayType

spark = SparkSession.builder.appName("SimpleArray").getOrCreate()

# Define schema with array
schema_array = StructType([
    StructField("employee_id", StringType(), False),
    StructField("name", StringType(), True),
    StructField("projects", ArrayType(StringType()), True)
])

# Sample data
data_array = [
    ("E001", "Alice", ["Project A", "Project B"]),
    ("E002", "Bob", ["Project C"]),
    ("E003", "Cathy", [])
]

# Create DataFrame
df_array = spark.createDataFrame(data_array, schema_array)
df_array.show(truncate=False)

Output:

+-----------+-----+--------------------+
|employee_id|name |projects            |
+-----------+-----+--------------------+
|E001       |Alice|[Project A, Project B]|
|E002       |Bob  |[Project C]         |
|E003       |Cathy|[]                  |
+-----------+-----+--------------------+

This supports array operations like exploding. Ensure projects contains lists.

Creating a DataFrame with Nested Structs

Nested structs, like a details struct containing contact and address, model deeply hierarchical data, extending simple structs for complex ETL tasks, as seen in DataFrame UDFs:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, LongType

spark = SparkSession.builder.appName("NestedStructs").getOrCreate()

# Define schema with nested structs
schema_nested = StructType([
    StructField("employee_id", StringType(), False),
    StructField("name", StringType(), True),
    StructField("details", StructType([
        StructField("contact", StructType([
            StructField("phone", LongType(), True),
            StructField("email", StringType(), True)
        ]), True),
        StructField("address", StructType([
            StructField("city", StringType(), True)
        ]), True)
    ]), True)
])

# Sample data
data_nested = [
    ("E001", "Alice", ((1234567890, "alice@example.com"), ("New York",))),
    ("E002", "Bob", ((9876543210, "bob@example.com"), ("Boston",)))
]

# Create DataFrame
df_nested = spark.createDataFrame(data_nested, schema_nested)
df_nested.show(truncate=False)

Output:

+-----------+-----+-----------------------------------------------+
|employee_id|name |details                                        |
+-----------+-----+-----------------------------------------------+
|E001       |Alice|[{1234567890, alice@example.com}, {New York}]  |
|E002       |Bob  |[{9876543210, bob@example.com}, {Boston}]      |
+-----------+-----+-----------------------------------------------+

This enables queries like SELECT details.contact.email. Ensure nested data matches the schema depth.

Creating a DataFrame with Arrays of Structs

Arrays of structs, like a list of skills with years and certifications, model complex one-to-many relationships, extending simple arrays for advanced ETL analytics, as discussed in DataFrame UDFs:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType

spark = SparkSession.builder.appName("ArrayOfStructs").getOrCreate()

# Define schema with array of structs
schema_complex = StructType([
    StructField("employee_id", StringType(), False),
    StructField("name", StringType(), True),
    StructField("skills", ArrayType(StructType([
        StructField("year", IntegerType(), True),
        StructField("certification", StringType(), True)
    ])), True)
])

# Sample data
data_complex = [
    ("E001", "Alice", [(2023, "Python"), (2024, "Spark")]),
    ("E002", "Bob", [(2022, "Java")]),
    ("E003", "Cathy", [])
]

# Create DataFrame
df_complex = spark.createDataFrame(data_complex, schema_complex)
df_complex.show(truncate=False)

Output:

+-----------+-----+---------------------------------+
|employee_id|name |skills                           |
+-----------+-----+---------------------------------+
|E001       |Alice|[{2023, Python}, {2024, Spark}]  |
|E002       |Bob  |[{2022, Java}]                   |
|E003       |Cathy|[]                               |
+-----------+-----+---------------------------------+

This supports queries like exploding skills. Error to Watch: Incomplete structs in arrays fail:

data_invalid = [("E001", "Alice", [(2023,)])]
try:
    df_invalid = spark.createDataFrame(data_invalid, schema_complex)
    df_invalid.show()
except Exception as e:
    print(f"Error: {e}")

Output:

Error: field skills: ArrayType(StructType(...)) can not accept object [(2023,)] in type

Fix: Clean: data_clean = [(r[0], r[1], [(y, None) for y in r[2]]) for r in data_invalid]. Validate: [all(len(s) == 2 for s in row[2]) for row in data_complex].

Handling Null Values in Nested Structures

Null values in structs or arrays, like missing contacts or empty project lists, are common in semi-structured data. A schema with nullable=True handles these, extending arrays of structs for robust ETL pipelines, as seen in Column Null Handling. The above examples already include nulls (e.g., contact as (None, None)), managed by the schema.

How to Fix Common DataFrame Creation Errors

Errors can disrupt nested DataFrame creation. Here are key issues, with fixes:

Mismatched Struct Data: Incomplete structs fail. Fix: data_clean = [(r[0], r[1], (r[2][0], None)) for r in data_invalid]. Validate: [len(row[2]) == 2 for row in data].
Invalid Array Elements: Non-list or mismatched structs in arrays fail. Fix: data_clean = [(r[0], r[1], [r[2]] if isinstance(r[2], str) else r[2]) for r in data]. Validate: [isinstance(row[2], list) for row in data].
Schema Mismatch: Wrong schema types fail. Fix: Align schema with data. Validate: df.printSchema().

For more, see Error Handling and Debugging.

Wrapping Up Your DataFrame Creation Mastery

Creating a PySpark DataFrame with nested structs or arrays is a vital skill, and Spark’s createDataFrame method makes it easy to handle simple structs, arrays, and complex nested structures. These techniques will level up your ETL pipelines. Try them in your next Spark job, and share tips or questions in the comments or on X. Keep exploring with DataFrame Operations!

How to Create a PySpark DataFrame with Nested Structs or Arrays: The Ultimate Guide

Diving Straight into Creating PySpark DataFrames with Nested Structs or Arrays

Creating a DataFrame with Nested Structs or Arrays

Creating a DataFrame with a Simple Nested Struct

Creating a DataFrame with a Simple Array

Creating a DataFrame with Nested Structs

Creating a DataFrame with Arrays of Structs

Handling Null Values in Nested Structures

How to Fix Common DataFrame Creation Errors

Wrapping Up Your DataFrame Creation Mastery

More Spark Resources to Keep You Going