How to Display the First n Rows of a PySpark DataFrame: The Ultimate Guide
Published on April 17, 2025
Diving Straight into Displaying the First n Rows of a PySpark DataFrame
Need to peek at the first few rows of a PySpark DataFrame—like customer orders or log entries—to inspect your data or debug an ETL pipeline? Displaying the first n rows of a DataFrame is a fundamental skill for data engineers working with Apache Spark. It’s a quick way to understand your data’s structure and content without processing the entire dataset. This guide dives into the syntax and steps for displaying the first n rows of a PySpark DataFrame, with examples covering essential scenarios. We’ll tackle key errors to keep your pipelines robust. Let’s explore those rows! For more on PySpark, see Introduction to PySpark.
Displaying the First n Rows of a DataFrame
The primary method for displaying the first n rows of a PySpark DataFrame is the show(n) method, which prints the top n rows to the console. Alternatively, the limit(n) method combined with show() retrieves the first n rows as a new DataFrame. The SparkSession, Spark’s unified entry point, enables these operations on distributed datasets. This approach is ideal for ETL pipelines needing quick data inspection or debugging. Here’s the basic syntax for show(n):
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("DisplayFirstNRows").getOrCreate()
df = spark.createDataFrame(data, schema)
df.show(n)
Let’s apply it to an employee DataFrame with IDs, names, ages, and salaries, displaying the first 2 rows:
from pyspark.sql import SparkSession
# Initialize SparkSession
spark = SparkSession.builder.appName("DisplayFirstNRows").getOrCreate()
# Create DataFrame
data = [
("E001", "Alice", 25, 75000.0),
("E002", "Bob", 30, 82000.5),
("E003", "Cathy", 28, 90000.75),
("E004", "David", 35, 100000.25)
]
df = spark.createDataFrame(data, ["employee_id", "name", "age", "salary"])
# Display first 2 rows
df.show(2, truncate=False)
Output:
+-----------+-----+---+--------+
|employee_id|name |age|salary |
+-----------+-----+---+--------+
|E001 |Alice|25 |75000.0 |
|E002 |Bob |30 |82000.5 |
+-----------+-----+---+--------+
only showing top 2 rows
This displays the first 2 rows with full column values (truncate=False prevents truncation). Validate row count: assert df.count() >= 2, "Insufficient rows". For SparkSession details, see SparkSession in PySpark.
Displaying the First n Rows with Simple Data
Displaying the first n rows of a DataFrame with flat columns, like strings or numbers, is the go-to method for quick data inspection in ETL tasks, such as verifying loaded data, as seen in ETL Pipelines. The show(n) method is straightforward:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SimpleDataDisplay").getOrCreate()
# Create DataFrame
data = [
("E001", "Alice", 25, 75000.0),
("E002", "Bob", 30, 82000.5),
("E003", "Cathy", 28, 90000.75)
]
df = spark.createDataFrame(data, ["employee_id", "name", "age", "salary"])
# Display first 2 rows
df.show(2, truncate=False)
Output:
+-----------+-----+---+--------+
|employee_id|name |age|salary |
+-----------+-----+---+--------+
|E001 |Alice|25 |75000.0 |
|E002 |Bob |30 |82000.5 |
+-----------+-----+---+--------+
only showing top 2 rows
This provides a quick snapshot of the data. Error to Watch: Invalid n value fails:
try:
df.show(-1)
except Exception as e:
print(f"Error: {e}")
Output:
Error: n must be a non-negative integer
Fix: Ensure n is non-negative: assert n >= 0, "n must be non-negative". Check available rows: df.count().
Displaying the First n Rows with Nested Data
Nested DataFrames, with structs or arrays, model complex relationships, like employee contact details or project lists, extending simple displays for inspecting advanced ETL data, as discussed in DataFrame UDFs:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, LongType, ArrayType
spark = SparkSession.builder.appName("NestedDisplay").getOrCreate()
# Define schema with nested structs and arrays
schema = StructType([
StructField("employee_id", StringType(), False),
StructField("name", StringType(), True),
StructField("contact", StructType([
StructField("phone", LongType(), True),
StructField("email", StringType(), True)
]), True),
StructField("projects", ArrayType(StringType()), True)
])
# Create DataFrame
data = [
("E001", "Alice", (1234567890, "alice}example.com"), ["Project A", "Project B"]),
("E002", "Bob", (9876543210, "bob}example.com"), ["Project C"]),
("E003", "Cathy", (None, None), [])
]
df = spark.createDataFrame(data, schema)
# Display first 2 rows
df.show(2, truncate=False)
Output:
+-----------+-----+--------------------------------+---------------------+
|employee_id|name |contact |projects |
+-----------+-----+--------------------------------+---------------------+
|E001 |Alice|[1234567890, alice}example.com] |[Project A, Project B]|
|E002 |Bob |[9876543210, bob}example.com] |[Project C] |
+-----------+-----+--------------------------------+---------------------+
only showing top 2 rows
This displays nested structs and arrays clearly, aiding inspection of complex data. Validate: assert isinstance(df.schema["contact"].dataType, StructType), "Nested schema missing".
Displaying the First n Rows Using limit()
The limit(n) method retrieves the first n rows as a new DataFrame, offering flexibility to process or collect the results, extending nested data displays for custom outputs, as seen in DataFrame Operations:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("LimitDisplay").getOrCreate()
# Create DataFrame
data = [
("E001", "Alice", 25, 75000.0),
("E002", "Bob", 30, 82000.5),
("E003", "Cathy", 28, 90000.75)
]
df = spark.createDataFrame(data, ["employee_id", "name", "age", "salary"])
# Display first 2 rows using limit
df_limit = df.limit(2)
df_limit.show(truncate=False)
Output:
+-----------+-----+---+--------+
|employee_id|name |age|salary |
+-----------+-----+---+--------+
|E001 |Alice|25 |75000.0 |
|E002 |Bob |30 |82000.5 |
+-----------+-----+---+--------+
This creates a new DataFrame with the first 2 rows, useful for further processing. Error to Watch: Negative limit fails:
try:
df_invalid = df.limit(-1)
df_invalid.show()
except Exception as e:
print(f"Error: {e}")
Output:
Error: Limit must be non-negative
Fix: Ensure n is non-negative: assert n >= 0, "Limit must be non-negative".
How to Fix Common Display Errors
Errors can disrupt displaying DataFrame rows. Here are key issues, with fixes:
- Invalid n Value: Negative n fails. Fix: assert n >= 0, "n must be non-negative". Validate row count: df.count().
- Empty DataFrame: Displaying rows from an empty DataFrame shows nothing. Fix: Check: assert df.count() > 0, "DataFrame empty".
- Large n Value: Requesting more rows than available is safe but may show all rows. Fix: Compare n with df.count().
For more, see Error Handling and Debugging.
Wrapping Up Your DataFrame Display Mastery
Displaying the first n rows of a PySpark DataFrame is a vital skill, and Spark’s show(n) and limit(n) methods make it easy to handle simple and nested data. These techniques will level up your ETL pipelines. Try them in your next Spark job, and share tips or questions in the comments or on X. Keep exploring with DataFrame Operations!