How to Filter Rows Where a Column Contains a Substring in a PySpark DataFrame: The Ultimate Guide

Published on April 17, 2025

Diving Straight into Filtering Rows by Substring in a PySpark DataFrame

Filtering rows in a PySpark DataFrame where a column contains a specific substring is a key technique for data engineers using Apache Spark. Whether you're searching for names containing a certain pattern, identifying records with specific keywords, or refining datasets for analysis, this operation enables targeted data extraction in ETL pipelines. This comprehensive guide explores the syntax and steps for filtering rows based on substring matches, with examples covering basic substring filtering, case-insensitive searches, nested data, and SQL-based approaches. Each section addresses a specific aspect of substring filtering, supported by practical code, error handling, and performance optimization strategies to build robust pipelines. The primary method, filter() with contains(), is explained with all relevant considerations. Let’s find those substrings! For more on PySpark, see PySpark Fundamentals.

Filtering Rows with a Substring Match

The primary method for filtering rows in a PySpark DataFrame is the filter() method (or its alias where()), combined with the contains() function to check if a column’s string values include a specific substring. This approach is ideal for ETL pipelines needing to select records based on partial string matches, such as names or categories.

Understanding filter(), where(), and contains() Parameters

filter(condition) or where(condition):

condition (Column or str, required): A boolean expression defining the filtering criteria, such as col("column").contains("substring") or a SQL-like string (e.g., "column LIKE '%substring%' ").
Returns: A new DataFrame containing only the rows where the condition evaluates to True.
Note: filter() and where() are interchangeable, with where() offering a SQL-like syntax for readability.

contains(literal) (Column method, from pyspark.sql.functions):

literal (str, required): The substring to search for within the column’s string values.
Returns: A Column expression evaluating to True if the substring is found, False otherwise.
Note: contains() is case-sensitive and returns False for null values.

Here’s an example filtering employees whose names contain the substring "li":

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Initialize SparkSession
spark = SparkSession.builder.appName("SubstringFilter").getOrCreate()

# Create DataFrame
data = [
    ("E001", "Alice", 25, 75000.0, "HR"),
    ("E002", "Bob", 30, 82000.5, "IT"),
    ("E003", "Cathy", 28, 90000.75, "HR"),
    ("E004", "David", 35, 100000.25, "IT"),
    ("E005", "Eve", 28, 78000.0, "Finance")
]
df = spark.createDataFrame(data, ["employee_id", "name", "age", "salary", "department"])

# Filter rows where name contains "li"
filtered_df = df.filter(col("name").contains("li"))
filtered_df.show(truncate=False)

Output:

+-----------+-----+---+-------+----------+
|employee_id|name |age|salary |department|
+-----------+-----+---+-------+----------+
|E001       |Alice|25 |75000.0|HR        |
+-----------+-----+---+-------+----------+

This filters rows where the name column contains "li", returning one row (E001 for "Alice"). The contains() function performs a case-sensitive substring search. Validate:

assert filtered_df.count() == 1, "Incorrect row count"
assert "Alice" in [row["name"] for row in filtered_df.select("name").collect()], "Expected name missing"

Error to Watch: Filtering on a non-string column fails:

try:
    filtered_df = df.filter(col("age").contains("25"))  # Non-string column
    filtered_df.show()
except Exception as e:
    print(f"Error: {e}")

Output:

Error: contains() is not defined for non-string columns

Fix: Verify column type:

assert df.schema["name"].dataType.typeName() == "string", "Column must be string type"

Filtering with Case-Insensitive Substring Matches

For case-insensitive substring searches, use lower() to convert the column values to lowercase before applying contains(), or leverage like() with SQL-like patterns for flexibility. This is useful when substring matches should ignore case, such as searching for names regardless of capitalization.

from pyspark.sql.functions import col, lower

# Case-insensitive filter where name contains "LI"
filtered_df = df.filter(lower(col("name")).contains("li"))
filtered_df.show(truncate=False)

Output:

+-----------+-----+---+-------+----------+
|employee_id|name |age|salary |department|
+-----------+-----+---+-------+----------+
|E001       |Alice|25 |75000.0|HR        |
+-----------+-----+---+-------+----------+

This converts name to lowercase and filters for "li", returning the same row (E001). Alternatively, use like() with wildcards:

filtered_df = df.filter(col("name").like("%[Ll][Ii]%"))
filtered_df.show(truncate=False)

Output: Same as above.

The like() method supports SQL-like patterns, where % matches any characters, and [Ll][Ii] matches "li" or "LI". Validate:

assert filtered_df.count() == 1, "Incorrect row count"
assert "Alice" in [row["name"] for row in filtered_df.select("name").collect()], "Expected name missing"

Error to Watch: Invalid pattern in like() fails:

try:
    filtered_df = df.filter(col("name").like("%[Li"))  # Unclosed bracket
    filtered_df.show()
except Exception as e:
    print(f"Error: {e}")

Output:

Error: Invalid pattern syntax

Fix: Validate pattern:

import re
try:
    re.compile("%[Ll][Ii]%")
except re.error:
    raise ValueError("Invalid LIKE pattern")

Filtering Nested Data with Substring Matches

Nested DataFrames, with structs or arrays, are common in complex datasets like employee contact details. Filtering rows where a nested field, such as contact.email, contains a substring requires dot notation (e.g., contact.email) within filter(). This is essential for hierarchical data in ETL pipelines.

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, LongType
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("NestedSubstringFilter").getOrCreate()

# Define schema with nested structs
schema = StructType([
    StructField("employee_id", StringType(), False),
    StructField("name", StringType(), True),
    StructField("contact", StructType([
        StructField("phone", LongType(), True),
        StructField("email", StringType(), True)
    ]), True),
    StructField("department", StringType(), True)
])

# Create DataFrame
data = [
    ("E001", "Alice", (1234567890, "alice@company.com"), "HR"),
    ("E002", "Bob", (None, "bob@company.com"), "IT"),
    ("E003", "Cathy", (5555555555, "cathy@company.com"), "HR"),
    ("E004", "David", (9876543210, "david@company.com"), "IT")
]
df = spark.createDataFrame(data, schema)

# Filter rows where contact.email contains "company"
filtered_df = df.filter(col("contact.email").contains("company"))
filtered_df.show(truncate=False)

Output:

+-----------+-----+--------------------------------+----------+
|employee_id|name |contact                         |department|
+-----------+-----+--------------------------------+----------+
|E001       |Alice|[1234567890, alice@company.com] |HR        |
|E002       |Bob  |[null, bob@company.com]         |IT        |
|E003       |Cathy|[5555555555, cathy@company.com] |HR        |
|E004       |David|[9876543210, david@company.com] |IT        |
+-----------+-----+--------------------------------+----------+

This filters rows where contact.email contains "company", returning all four rows. Validate:

assert filtered_df.count() == 4, "Incorrect row count"
assert "alice@company.com" in [row["contact"]["email"] for row in filtered_df.collect()], "Expected email missing"

Error to Watch: Null nested fields cause unexpected results:

# Example with null email
data_with_null = data + [("E005", "Eve", (None, None), "Finance")]
df_null = spark.createDataFrame(data_with_null, schema)
filtered_df = df_null.filter(col("contact.email").contains("company"))
# Excludes E005 due to null email

Fix: Handle nulls explicitly:

filtered_df = df_null.filter(col("contact.email").contains("company") & col("contact.email").isNotNull())
assert filtered_df.count() == 4, "Nulls not handled correctly"

Filtering Using SQL Queries

For SQL-based ETL workflows or teams familiar with database querying, SQL queries via temporary views offer an intuitive way to filter rows by substring. The LIKE operator with wildcards (%) or CONTAINS in Spark SQL provides substring matching functionality.

# Create temporary view
df.createOrReplaceTempView("employees")

# Filter rows where name contains "li" using SQL
filtered_df = spark.sql("SELECT * FROM employees WHERE name LIKE '%li%'")
filtered_df.show(truncate=False)

Output:

+-----------+-----+---+-------+----------+
|employee_id|name |age|salary |department|
+-----------+-----+---+-------+----------+
|E001       |Alice|25 |75000.0|HR        |
+-----------+-----+---+-------+----------+

This filters rows where name contains "li" using SQL’s LIKE operator. For case-insensitive matching, use LOWER():

filtered_df = spark.sql("SELECT * FROM employees WHERE LOWER(name) LIKE '%li%'")
filtered_df.show(truncate=False)

Output: Same as above.

Validate:

assert filtered_df.count() == 1, "Incorrect row count"
assert "Alice" in [row["name"] for row in filtered_df.select("name").collect()], "Expected name missing"

Error to Watch: Unregistered view fails:

try:
    filtered_df = spark.sql("SELECT * FROM nonexistent WHERE name LIKE '%li%'")
    filtered_df.show()
except Exception as e:
    print(f"Error: {e}")

Output:

Error: Table or view not found: nonexistent

Fix: Verify view:

assert "employees" in [v.name for v in spark.catalog.listTables()], "View missing"
df.createOrReplaceTempView("employees")

Optimizing Performance for Substring Filtering

Filtering rows with substring matches involves scanning the DataFrame and evaluating string operations, which can be computationally expensive for large datasets. Optimize performance to ensure efficient data extraction:

Select Relevant Columns: Reduce data scanned:

df = df.select("employee_id", "name", "department")

Push Down Filters: Apply filters early to minimize data:

df = df.filter(col("name").contains("li"))

Partition Data: Use partitionBy or repartition for large datasets:

df = df.repartition("department")

Cache Intermediate Results: Cache filtered DataFrame if reused:

filtered_df.cache()

Example optimized filter:

optimized_df = df.select("employee_id", "name", "department") \
                .filter(col("name").contains("li")) \
                .repartition("department")
optimized_df.show(truncate=False)

Monitor performance via the Spark UI, focusing on scan and filter metrics.

Error to Watch: Large datasets with string operations slow performance:

# Example with large DataFrame
large_df = spark.range(10000000).join(df, "employee_id", "left")
filtered_df = large_df.filter(col("name").contains("li"))  # Inefficient

Fix: Optimize with early filtering and partitioning:

assert large_df.count() < 10000000, "Large dataset, optimize with early filters or partitioning"

Wrapping Up Your Substring Filtering Mastery

Filtering rows where a column contains a substring in a PySpark DataFrame is a vital skill for targeted data extraction in ETL pipelines. Whether you’re using filter() with contains() for basic substring matches, lower() for case-insensitive searches, dot notation for nested data, or SQL queries with LIKE for intuitive filtering, Spark provides powerful tools to address diverse data processing needs. By mastering these techniques, optimizing performance, and anticipating errors, you can efficiently refine datasets, enabling accurate analyses and robust applications. These methods will enhance your data engineering workflows, empowering you to manage substring-based filtering with confidence.

Try these approaches in your next Spark job, and share your experiences, tips, or questions in the comments or on X. Keep exploring with DataFrame Operations to deepen your PySpark expertise!

How to Filter Rows Where a Column Contains a Substring in a PySpark DataFrame: The Ultimate Guide

Diving Straight into Filtering Rows by Substring in a PySpark DataFrame

Filtering Rows with a Substring Match

Understanding filter(), where(), and contains() Parameters

Filtering with Case-Insensitive Substring Matches

Filtering Nested Data with Substring Matches

Filtering Using SQL Queries

Optimizing Performance for Substring Filtering

Wrapping Up Your Substring Filtering Mastery

More Spark Resources to Keep You Going