How to Filter Rows with Null or Non-Null Values in a PySpark DataFrame Column: The Ultimate Guide

Published on April 17, 2025


Diving Straight into Filtering Rows with Null or Non-Null Values in a PySpark DataFrame

Filtering rows in a PySpark DataFrame based on whether a column contains null or non-null values is a critical skill for data engineers using Apache Spark. This technique is essential for data quality checks, cleaning datasets, or isolating valid records in ETL pipelines, such as identifying missing data or ensuring complete entries for analysis. This comprehensive guide explores the syntax and steps for filtering rows with null or non-null values in a column, with examples covering basic null filtering, combining with other conditions, nested data, and SQL-based approaches. Each section addresses a specific aspect of null-based filtering, supported by practical code, error handling, and performance optimization strategies to build robust pipelines. The primary method, filter() with isNull() or isNotNull(), is explained with all relevant considerations. Let’s clean those datasets! For more on PySpark, see PySpark Fundamentals.


Filtering Rows with Null Values in a Column

The primary method for filtering rows in a PySpark DataFrame is the filter() method (or its alias where()), combined with the isNull() function to identify rows where a column’s value is null. This approach is ideal for ETL pipelines needing to detect missing data, such as incomplete employee records.

Understanding filter(), where(), isNull(), and isNotNull() Parameters

  • filter(condition) or where(condition):
    • condition (Column or str, required): A boolean expression defining the filtering criteria, such as col("column").isNull() or a SQL-like string (e.g., "column IS NULL").
    • Returns: A new DataFrame containing only the rows where the condition evaluates to True.
    • Note: filter() and where() are interchangeable, with where() offering a SQL-like syntax for readability.
  • isNull() (Column method, from pyspark.sql.functions):
    • Parameters: None.
    • Returns: A Column expression evaluating to True if the column’s value is null, False otherwise.
  • isNotNull() (Column method, from pyspark.sql.functions):
    • Parameters: None.
    • Returns: A Column expression evaluating to True if the column’s value is not null, False otherwise.

Here’s an example filtering employees with null values in the name column:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Initialize SparkSession
spark = SparkSession.builder.appName("NullFilter").getOrCreate()

# Create DataFrame with nulls
data = [
    ("E001", "Alice", 25, 75000.0, "HR"),
    ("E002", None, 30, 82000.5, "IT"),
    ("E003", "Cathy", 28, 90000.75, "HR"),
    ("E004", None, 35, 100000.25, "IT"),
    ("E005", "Eve", 28, 78000.0, "Finance")
]
df = spark.createDataFrame(data, ["employee_id", "name", "age", "salary", "department"])

# Filter rows where name is null
filtered_df = df.filter(col("name").isNull())
filtered_df.show(truncate=False)

Output:

+-----------+----+---+---------+----------+
|employee_id|name|age|salary   |department|
+-----------+----+---+---------+----------+
|E002       |null|30 |82000.5  |IT        |
|E004       |null|35 |100000.25|IT        |
+-----------+----+---+---------+----------+

This filters rows where name is null, returning two rows (E002, E004). The isNull() function identifies null values in the column. Validate:

assert filtered_df.count() == 2, "Incorrect row count"
assert all(row["name"] is None for row in filtered_df.collect()), "Non-null name included"

Error to Watch: Filtering on a non-existent column fails:

try:
    filtered_df = df.filter(col("invalid_column").isNull())
    filtered_df.show()
except Exception as e:
    print(f"Error: {e}")

Output:

Error: Column 'invalid_column' does not exist

Fix: Verify column:

assert "name" in df.columns, "Column missing"

Filtering Rows with Non-Null Values in a Column

To filter rows where a column has non-null values, use isNotNull() within filter(). This is useful for ensuring data completeness, such as selecting records with valid entries for analysis.

from pyspark.sql.functions import col

# Filter rows where name is not null
filtered_df = df.filter(col("name").isNotNull())
filtered_df.show(truncate=False)

Output:

+-----------+-----+---+---------+----------+
|employee_id|name |age|salary   |department|
+-----------+-----+---+---------+----------+
|E001       |Alice|25 |75000.0  |HR        |
|E003       |Cathy|28 |90000.75 |HR        |
|E005       |Eve  |28 |78000.0  |Finance   |
+-----------+-----+---+---------+----------+

This filters rows where name is not null, returning three rows (E001, E003, E005). Validate:

assert filtered_df.count() == 3, "Incorrect row count"
assert all(row["name"] is not None for row in filtered_df.collect()), "Null name included"

Error to Watch: Misinterpreting null handling can include unintended rows:

# Example expecting non-null names but not checking other columns
filtered_df = df.filter(col("name").isNotNull())
# Still includes rows with potential nulls in other columns

Fix: Explicitly check multiple columns if needed:

filtered_df = df.filter(col("name").isNotNull() & col("salary").isNotNull())
assert filtered_df.count() == 3, "Ensure all required columns are non-null"

Combining Null/Non-Null Filtering with Other Conditions

To filter rows based on null or non-null values alongside other conditions, combine isNull() or isNotNull() with additional criteria using logical operators (&, |, ~). This is useful for complex ETL filtering, such as selecting non-null records meeting specific thresholds.

from pyspark.sql.functions import col

# Filter rows where name is not null and salary > 80000
filtered_df = df.filter((col("name").isNotNull()) & (col("salary") > 80000))
filtered_df.show(truncate=False)

Output:

+-----------+-----+---+---------+----------+
|employee_id|name |age|salary   |department|
+-----------+-----+---+---------+----------+
|E003       |Cathy|28 |90000.75 |HR        |
+-----------+-----+---+---------+----------+

This filters rows where name is not null and salary exceeds 80,000, returning one row (E003). Validate:

assert filtered_df.count() == 1, "Incorrect row count"
assert filtered_df.filter(col("name").isNull()).count() == 0, "Null name included"

Error to Watch: Incorrect operator precedence fails:

try:
    filtered_df = df.filter(col("name").isNotNull() & col("salary") > 80000)  # Missing parentheses
    filtered_df.show()
except Exception as e:
    print(f"Error: {e}")

Output:

Error: Operator precedence error

Fix: Use parentheses:

assert isinstance((col("name").isNotNull()) & (col("salary") > 80000), Column), "Invalid condition syntax"

Filtering Nested Data with Null/Non-Null Values

Nested DataFrames, with structs or arrays, are common in complex datasets like employee contact details. Filtering rows based on null or non-null values in a nested field, such as contact.email, requires dot notation (e.g., contact.email) with isNull() or isNotNull(). This is crucial for hierarchical data in ETL pipelines.

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, LongType
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("NestedNullFilter").getOrCreate()

# Define schema with nested structs
schema = StructType([
    StructField("employee_id", StringType(), False),
    StructField("name", StringType(), True),
    StructField("contact", StructType([
        StructField("phone", LongType(), True),
        StructField("email", StringType(), True)
    ]), True),
    StructField("department", StringType(), True)
])

# Create DataFrame
data = [
    ("E001", "Alice", (1234567890, "alice@company.com"), "HR"),
    ("E002", "Bob", (None, None), "IT"),
    ("E003", "Cathy", (5555555555, "cathy@company.com"), "HR"),
    ("E004", "David", (9876543210, None), "IT")
]
df = spark.createDataFrame(data, schema)

# Filter rows where contact.email is not null
filtered_df = df.filter(col("contact.email").isNotNull())
filtered_df.show(truncate=False)

Output:

+-----------+-----+--------------------------------+----------+
|employee_id|name |contact                         |department|
+-----------+-----+--------------------------------+----------+
|E001       |Alice|[1234567890, alice@company.com] |HR        |
|E003       |Cathy|[5555555555, cathy@company.com] |HR        |
+-----------+-----+--------------------------------+----------+

This filters rows where contact.email is not null, returning two rows (E001, E003). Validate:

assert filtered_df.count() == 2, "Incorrect row count"
assert "alice@company.com" in [row["contact"]["email"] for row in filtered_df.collect()], "Expected email missing"

Error to Watch: Invalid nested field fails:

try:
    filtered_df = df.filter(col("contact.invalid_field").isNotNull())
    filtered_df.show()
except Exception as e:
    print(f"Error: {e}")

Output:

Error: StructField 'contact' does not contain field 'invalid_field'

Fix: Validate nested field:

assert "email" in [f.name for f in df.schema["contact"].dataType.fields], "Nested field missing"

Filtering Using SQL Queries

For SQL-based ETL workflows or teams familiar with database querying, SQL queries via temporary views offer an intuitive way to filter rows with null or non-null values. The IS NULL and IS NOT NULL operators provide null-based filtering in a SQL-like syntax.

# Create temporary view
df.createOrReplaceTempView("employees")

# Filter rows where name is null using SQL
filtered_df = spark.sql("SELECT * FROM employees WHERE name IS NULL")
filtered_df.show(truncate=False)

Output:

+-----------+----+---+---------+----------+
|employee_id|name|age|salary   |department|
+-----------+----+---+---------+----------+
|E002       |null|30 |82000.5  |IT        |
|E004       |null|35 |100000.25|IT        |
+-----------+----+---+---------+----------+

This filters rows where name is null using SQL. For non-null filtering:

filtered_df = spark.sql("SELECT * FROM employees WHERE name IS NOT NULL")
filtered_df.show(truncate=False)

Output:

+-----------+-----+---+---------+----------+
|employee_id|name |age|salary   |department|
+-----------+-----+---+---------+----------+
|E001       |Alice|25 |75000.0  |HR        |
|E003       |Cathy|28 |90000.75 |HR        |
|E005       |Eve  |28 |78000.0  |Finance   |
+-----------+-----+---+---------+----------+

Validate:

assert filtered_df.count() == 3, "Incorrect row count"
assert filtered_df.filter(col("name").isNull()).count() == 0, "Null name included"

Error to Watch: Unregistered view fails:

try:
    filtered_df = spark.sql("SELECT * FROM nonexistent WHERE name IS NULL")
    filtered_df.show()
except Exception as e:
    print(f"Error: {e}")

Output:

Error: Table or view not found: nonexistent

Fix: Verify view:

assert "employees" in [v.name for v in spark.catalog.listTables()], "View missing"
df.createOrReplaceTempView("employees")

Optimizing Performance for Null/Non-Null Filtering

Filtering rows based on null or non-null values involves scanning the DataFrame, which can be intensive for large datasets. Optimize performance to ensure efficient data extraction:

  1. Select Relevant Columns: Reduce data scanned:
df = df.select("employee_id", "name", "salary")
  1. Push Down Filters: Apply filters early:
df = df.filter(col("name").isNotNull())
  1. Partition Data: Use partitionBy or repartition:
df = df.repartition("department")
  1. Cache Intermediate Results: Cache filtered DataFrame if reused:
filtered_df.cache()

Example optimized filter:

optimized_df = df.select("employee_id", "name", "salary") \
                .filter((col("name").isNotNull()) & (col("salary") > 80000)) \
                .repartition("department")
optimized_df.show(truncate=False)

Monitor performance via the Spark UI, focusing on scan and filter metrics.

Error to Watch: Large datasets with inefficient filtering slow performance:

# Example with large DataFrame
large_df = spark.range(10000000).join(df, "employee_id", "left")
filtered_df = large_df.filter(col("name").isNotNull())  # Inefficient

Fix: Optimize with early filtering and partitioning:

assert large_df.count() < 10000000, "Large dataset, optimize with early filters or partitioning"

Wrapping Up Your Null/Non-Null Filtering Mastery

Filtering rows with null or non-null values in a PySpark DataFrame column is a critical skill for ensuring data quality in ETL pipelines. Whether you’re using filter() with isNull() or isNotNull() for basic null checks, combining with other conditions, handling nested data with dot notation, or leveraging SQL queries with IS NULL/IS NOT NULL, Spark provides powerful tools to address diverse data processing needs. By mastering these techniques, optimizing performance, and anticipating errors, you can efficiently refine datasets, enabling accurate analyses and robust applications. These methods will enhance your data engineering workflows, empowering you to manage null-based filtering with confidence.

Try these approaches in your next Spark job, and share your experiences, tips, or questions in the comments or on X. Keep exploring with DataFrame Operations to deepen your PySpark expertise!


More Spark Resources to Keep You Going