Summary Operation in PySpark DataFrames: A Comprehensive Guide
PySpark’s DataFrame API is a versatile powerhouse for big data analysis, and the summary operation stands out as a flexible tool to generate a customizable set of summary statistics for your DataFrame’s numerical columns. It’s like dialing in a tailored report—you can pick exactly which stats you want, from counts and means to percentiles, giving you a precise view of your data’s shape and distribution. Whether you’re diving into exploratory analysis, fine-tuning data quality checks, or prepping for advanced modeling, summary delivers a detailed snapshot that adapts to your needs. Built into Spark’s Spark SQL engine and powered by the Catalyst optimizer, it computes these stats efficiently across your distributed dataset, returning a new DataFrame with the results. In this guide, we’ll dive into what summary does, explore how you can use it with plenty of detail, and highlight where it fits into real-world scenarios, all with examples that bring it to life.
Ready to customize your data insights with summary? Check out PySpark Fundamentals and let’s get rolling!
What is the Summary Operation in PySpark?
The summary operation in PySpark is a method you call on a DataFrame to compute a customizable set of summary statistics for its numerical columns, returning a new DataFrame with metrics like count, mean, standard deviation, min, max, and percentiles that you can specify—or a default set if you don’t. Think of it as a data spotlight—you shine it on your numbers and choose what to see, from basic counts to detailed quartiles, all in one concise table. When you use summary, Spark processes the data across the cluster, crunching the stats in parallel to keep it fast, even with massive datasets. It’s an action—running immediately when called—and it’s built into Spark’s Spark SQL engine, leveraging the Catalyst optimizer to ensure efficiency. You’ll find it coming up whenever you need a deeper, tailored look at your data’s numerical properties—whether you’re exploring trends, validating distributions, or digging into specific stats—offering a versatile step up from the fixed output of describe.
Here’s a quick look at how it works:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("QuickLook").getOrCreate()
data = [("Alice", 25, 160.5), ("Bob", 30, 175.0)]
df = spark.createDataFrame(data, ["name", "age", "height"])
summary = df.summary()
summary.show()
# Output:
# +-------+----+------------------+-----------------+
# |summary|name| age| height|
# +-------+----+------------------+-----------------+
# | count| 2| 2| 2|
# | mean|null| 27.5| 167.75|
# | stddev|null|3.5355339059327378|10.60660171779821|
# | min|Alice| 25.0| 160.5|
# | 25%| null| 25.0| 160.5|
# | 50%| null| 27.5| 167.75|
# | 75%| null| 30.0| 175.0|
# | max| Bob| 30.0| 175.0|
# +-------+----+------------------+-----------------+
spark.stop()
We start with a SparkSession, create a DataFrame with names, ages, and heights, and call summary with no args. Spark computes a default set—count, mean, stddev, min, 25%, 50%, 75%, max—showing stats for numeric columns, with "name" just getting a count. Want more on DataFrames? See DataFrames in PySpark. For setup help, check Installing PySpark.
The Statistics Parameter
When you use summary, you can optionally pass statistics as variable-length arguments (*statistics) to customize the output. Here’s how it works:
- **Statistics (*statistics)**: Strings—like "count", "mean", "50%"—naming the stats to compute. Options include "count", "mean", "stddev", "min", "max", and percentiles (e.g., "25%", "50%", "75%"). If you skip them, you get a default set: count, mean, stddev, min, 25%, 50%, 75%, max. Pick what you need—Spark computes only those, all for numeric columns unless you filter elsewhere.
Here’s an example with and without stats:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("StatsPeek").getOrCreate()
data = [("Alice", 25, 160.5)]
df = spark.createDataFrame(data, ["name", "age", "height"])
default_summary = df.summary()
default_summary.show()
custom_summary = df.summary("count", "mean", "50%")
custom_summary.show()
# Output (default):
# +-------+----+------------------+-----------------+
# |summary|name| age| height|
# +-------+----+------------------+-----------------+
# | count| 1| 1| 1|
# | mean|null| 25.0| 160.5|
# | stddev|null| 0.0| 0.0|
# | min|Alice| 25.0| 160.5|
# | 25%| null| 25.0| 160.5|
# | 50%| null| 25.0| 160.5|
# | 75%| null| 25.0| 160.5|
# | max|Alice| 25.0| 160.5|
# +-------+----+------------------+-----------------+
# Output (custom):
# +-------+----+------------------+-----------------+
# |summary|name| age| height|
# +-------+----+------------------+-----------------+
# | count| 1| 1| 1|
# | mean|null| 25.0| 160.5|
# | 50%| null| 25.0| 160.5|
# +-------+----+------------------+-----------------+
spark.stop()
We run it default, then pick "count", "mean", "50%"—tailored stats, less clutter. One row means zero stddev, but it’s focused.
Various Ways to Use Summary in PySpark
The summary operation offers several natural ways to summarize your DataFrame’s numerical data, each fitting into different scenarios. Let’s explore them with examples that show how it all plays out.
1. Getting a Broad Data Snapshot
When you want a full picture of your DataFrame’s numbers—like ranges, averages, and percentiles—summary with no args computes a default set of stats for all numeric columns, giving you a comprehensive overview. It’s a quick way to see the lay of the land.
This is perfect for initial exploration—say, scoping out user metrics. You get a detailed snapshot without picking specifics.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("BroadSnap").getOrCreate()
data = [("Alice", 25, 160.5), ("Bob", 30, 175.0), ("Cathy", 22, 155.0)]
df = spark.createDataFrame(data, ["name", "age", "height"])
summary = df.summary()
summary.show()
# Output:
# +-------+----+------------------+------------------+
# |summary|name| age| height|
# +-------+----+------------------+------------------+
# | count| 3| 3| 3|
# | mean|null|25.666666666666668|163.83333333333334|
# | stddev|null|4.0414518843273815|10.104224944795004|
# | min|Alice| 22.0| 155.0|
# | 25%| null| 22.0| 155.0|
# | 50%| null| 25.0| 160.5|
# | 75%| null| 30.0| 175.0|
# | max|Cathy| 30.0| 175.0|
# +-------+----+------------------+------------------+
spark.stop()
We run summary on the whole DataFrame—ages average 25.7, heights 163.8, with quartiles showing spread. If you’re exploring customer data, this paints the full picture.
2. Focusing on Key Statistics
When you only need certain stats—like count and median—summary lets you pick them, tailoring the output to your focus. It’s a way to zoom in on what matters without extra noise.
This comes up when analyzing specifics—maybe just checking counts and medians for sales. You name the stats, and it delivers just those.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("KeyStats").getOrCreate()
data = [("Alice", 25, 160.5), ("Bob", 30, 175.0)]
df = spark.createDataFrame(data, ["name", "age", "height"])
summary = df.summary("count", "50%")
summary.show()
# Output:
# +-------+----+------------------+-----------------+
# |summary|name| age| height|
# +-------+----+------------------+-----------------+
# | count| 2| 2| 2|
# | 50%| null| 27.5| 167.75|
# +-------+----+------------------+-----------------+
spark.stop()
We pick "count" and "50%" (median)—focused stats for "age" and "height". If you’re sizing up user medians, this keeps it tight.
3. Checking Data Distribution
When you need to see how data spreads—like quartiles or extremes—summary with percentiles shows the distribution, helping you understand variability and skew. It’s a way to map your data’s range.
This fits when validating—maybe checking income spread for outliers. You call out percentiles and get the full lay.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("DistCheck").getOrCreate()
data = [("Alice", 25, 160.5), ("Bob", 30, 175.0), ("Cathy", 22, 155.0)]
df = spark.createDataFrame(data, ["name", "age", "height"])
summary = df.summary("25%", "50%", "75%")
summary.show()
# Output:
# +-------+----+------------------+------------------+
# |summary|name| age| height|
# +-------+----+------------------+------------------+
# | 25%| null| 22.0| 155.0|
# | 50%| null| 25.0| 160.5|
# | 75%| null| 30.0| 175.0|
# +-------+----+------------------+------------------+
spark.stop()
We grab quartiles—age spreads from 22 to 30, height from 155 to 175. If you’re checking user height distribution, this nails it.
4. Prepping for Modeling with Stats
When you’re prepping data for modeling—like needing means and spreads—summary gives you custom stats to baseline features, guiding scaling or outlier handling. It’s a way to set your modeling stage.
This is key for machine learning—maybe sizing up features for normalization. You pick stats and get a tailored view.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("ModelStats").getOrCreate()
data = [("Alice", 25, 160.5), ("Bob", 30, 175.0), ("Cathy", 22, 155.0)]
df = spark.createDataFrame(data, ["name", "age", "height"])
summary = df.summary("mean", "stddev", "min", "max")
summary.show()
# Output:
# +-------+------------------+------------------+
# |summary| age| height|
# +-------+------------------+------------------+
# | mean|25.666666666666668|163.83333333333334|
# | stddev|4.0414518843273815|10.104224944795004|
# | min| 22.0| 155.0|
# | max| 30.0| 175.0|
# +-------+------------------+------------------+
spark.stop()
We pick mean, stddev, min, max—stats for scaling prep. If you’re modeling user profiles, this sets your feature bounds.
5. Comparing Data Subsets
When you’re comparing groups—like filtered slices—summary runs on each with custom stats, showing how they differ in spread or central tendency. It’s a way to contrast your data.
This fits when analyzing—maybe comparing sales by region. You filter, summarize, and spot shifts.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SubsetCompare").getOrCreate()
data = [("Alice", "HR", 25), ("Bob", "IT", 30), ("Cathy", "HR", 22)]
df = spark.createDataFrame(data, ["name", "dept", "age"])
hr_df = df.filter(df.dept == "HR")
it_df = df.filter(df.dept == "IT")
hr_summary = hr_df.summary("mean", "50%")
it_summary = it_df.summary("mean", "50%")
hr_summary.show()
it_summary.show()
# Output (HR):
# +-------+----+
# |summary| age|
# +-------+----+
# | mean|23.5|
# | 50%|23.5|
# +-------+----+
# Output (IT):
# +-------+----+
# |summary| age|
# +-------+----+
# | mean|30.0|
# | 50%|30.0|
# +-------+----+
spark.stop()
We split by dept, summarize mean and median—HR’s younger than IT. If you’re comparing team ages, this highlights the gap.
Common Use Cases of the Summary Operation
The summary operation fits into moments where tailored stats matter. Here’s where it naturally comes up.
1. Data Overview
For a broad look, summary gives default stats.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("DataView").getOrCreate()
df = spark.createDataFrame([(25, 160.5)], ["age", "height"])
df.summary().show()
# Output: +-------+----+-----------------+
# |summary| age| height|
# +-------+----+-----------------+
# | count| 1| 1|
# | mean|25.0| 160.5|
# | stddev| 0.0| 0.0|
# | min|25.0| 160.5|
# | 25%|25.0| 160.5|
# | 50%|25.0| 160.5|
# | 75%|25.0| 160.5|
# | max|25.0| 160.5|
# +-------+----+-----------------+
spark.stop()
2. Key Stats Focus
For specific stats, summary tailors it.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("KeyFocus").getOrCreate()
df = spark.createDataFrame([(25, 160.5)], ["age", "height"])
df.summary("count", "mean").show()
# Output: +-------+----+-----------------+
# |summary| age| height|
# +-------+----+-----------------+
# | count| 1| 1|
# | mean|25.0| 160.5|
# +-------+----+-----------------+
spark.stop()
3. Distribution Check
To see spread, summary shows percentiles.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Dist").getOrCreate()
df = spark.createDataFrame([(25, 160.5)], ["age", "height"])
df.summary("25%", "50%", "75%").show()
# Output: +-------+----+-----------------+
# |summary| age| height|
# +-------+----+-----------------+
# | 25%|25.0| 160.5|
# | 50%|25.0| 160.5|
# | 75%|25.0| 160.5|
# +-------+----+-----------------+
spark.stop()
4. Model Prep
For modeling, summary baselines stats.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Model").getOrCreate()
df = spark.createDataFrame([(25, 160.5)], ["age", "height"])
df.summary("mean", "stddev").show()
# Output: +-------+----+-----------------+
# |summary| age| height|
# +-------+----+-----------------+
# | mean|25.0| 160.5|
# | stddev| 0.0| 0.0|
# +-------+----+-----------------+
spark.stop()
FAQ: Answers to Common Summary Questions
Here’s a natural rundown on summary questions, with deep, clear answers.
Q: How’s it different from describe?
Summary is flexible—takes args like "median" or "25%" for custom stats, defaulting to count, mean, stddev, min, 25%, 50%, 75%, max. Describe is fixed—count, mean, stddev, min, max only. Summary tailors; describe is quick and set.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SummVsDesc").getOrCreate()
df = spark.createDataFrame([(25, 160.5)], ["age", "height"])
df.summary().show()
df.describe().show()
# Output (summary):
# +-------+----+-----------------+
# |summary| age| height|
# +-------+----+-----------------+
# | count| 1| 1|
# | mean|25.0| 160.5|
# | stddev| 0.0| 0.0|
# | min|25.0| 160.5|
# | 25%|25.0| 160.5|
# | 50%|25.0| 160.5|
# | 75%|25.0| 160.5|
# | max|25.0| 160.5|
# +-------+----+-----------------+
# Output (describe):
# +-------+------------------+-----------------+
# |summary| age| height|
# +-------+------------------+-----------------+
# | count| 1| 1|
# | mean| 25.0| 160.5|
# | stddev| null| null|
# | min| 25| 160.5|
# | max| 25| 160.5|
# +-------+------------------+-----------------+
spark.stop()
Q: Does it work on strings?
Yes, but limited—only "count", "min", "max" for non-numeric columns (e.g., strings). Numeric stats like mean or percentiles need numbers—strings get null there.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("StringSumm").getOrCreate()
df = spark.createDataFrame([("Alice", 25)], ["name", "age"])
df.summary().show()
# Output: +-------+----+------------------+
# |summary|name| age|
# +-------+----+------------------+
# | count| 1| 1|
# | mean|null| 25.0|
# | stddev|null| 0.0|
# | min|Alice| 25.0|
# | 25%| null| 25.0|
# | 50%| null| 25.0|
# | 75%| null| 25.0|
# | max|Alice| 25.0|
# +-------+----+------------------+
spark.stop()
Q: How’s performance with big data?
Summary scales well—it’s an action, computing stats in parallel across the cluster. For huge data, it’s fast, though memory and rows impact it—Spark’s optimizer keeps it efficient.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("BigPerf").getOrCreate()
df = spark.createDataFrame([(i, float(i)) for i in range(10000)], ["age", "height"])
df.summary().show()
# Output: Stats for 10000 rows, fast
spark.stop()
Q: Can I pick any percentile?
Yes—any valid percentile (e.g., "10%", "99%") works with summary. Pass them as strings—Spark computes what you ask, keeping it flexible.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("AnyPerc").getOrCreate()
df = spark.createDataFrame([(25, 160.5), (30, 175.0)], ["age", "height"])
df.summary("10%", "90%").show()
# Output: +-------+----+------------------+
# |summary| age| height|
# +-------+----+------------------+
# | 10%|25.0| 160.5|
# | 90%|30.0| 175.0|
# +-------+----+------------------+
spark.stop()
Q: What if data’s all null?
Summary runs—count is 0, others null. It handles empty or null-heavy data gracefully, showing what’s there (or not).
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("AllNull").getOrCreate()
df = spark.createDataFrame([(None, None)], ["age", "height"])
df.summary().show()
# Output: +-------+----+------+
# |summary| age|height|
# +-------+----+------+
# | count| 0| 0|
# | mean|null| null|
# | stddev|null| null|
# | min|null| null|
# | 25%| null| null|
# | 50%| null| null|
# | 75%| null| null|
# | max|null| null|
# +-------+----+------+
spark.stop()
Summary vs Other DataFrame Operations
The summary operation tailors numeric stats, unlike describe (fixed stats) or toJSON (JSON output). It’s not about raw data like rdd or views like createTempView—it’s a flexible stats tool, managed by Spark’s Catalyst engine, distinct from ops like show.
More details at DataFrame Operations.
Conclusion
The summary operation in PySpark is a versatile, customizable way to summarize your DataFrame’s numbers, delivering tailored insights with a simple call. Master it with PySpark Fundamentals to elevate your data skills!