How to Pivot and Unpivot Row Data in Spark
Data transformation is an essential task in the data processing pipeline, and Spark provides a variety of functions and methods to achieve it. One of the most common transformation tasks is pivoting and unpivoting data. In this blog, we will discuss how to pivot and unpivot data in Spark.
Pivoting Data
Pivoting is a process of transforming a dataset from a long to a wide format by rotating the rows to columns. In Spark, we can pivot data using the pivot()
function. The pivot function requires three arguments: the first argument is the pivot column, the second argument is the values column, and the third argument is the list of values to pivot.
Example
Consider the following example:
+----+------+-------+
|year|month |amount |
+----+------+-------+
|2021|Jan |100 |
|2021|Feb |200 |
|2021|Mar |300 |
|2022|Jan |400 |
|2022|Feb |500 |
|2022|Mar |600 |
+----+------+-------+
If we want to pivot the data based on the month
column, we can use the following code:
df.pivot("month", "year").show()
The output will be:
+----+----+----+
| _1 |2021|2022|
+----+----+----+
|Jan |100 |400 |
|Feb |200 |500 |
|Mar |300 |600 |
+----+----+----+
Unpivoting Data
Unpivoting is a process of transforming a dataset from a wide to a long format by rotating the columns to rows. In Spark, we can unpivot data using the explode()
function.
Example
Consider the following example:
+----+------+------+------+
|year| Jan | Feb | Mar |
+----+------+------+------+
|2021| 100 | 200 | 300 |
|2022| 400 | 500 | 600 |
+----+------+------+------+
If we want to unpivot the data based on the year
column, we can use the following code:
from pyspark.sql.functions import expr, explode
df.selectExpr("year", "stack(3, 'Jan', Jan, 'Feb', Feb, 'Mar', Mar) as (month, amount)")
.select("year", explode(expr("map(month, amount)"))).show()
The output will be:
+----+-----+---+
|year| key|value|
+----+-----+---+
|2021| Jan|100|
|2021| Feb|200|
|2021| Mar|300|
|2022| Jan|400|
|2022| Feb|500|
|2022| Mar|600|
+----+-----+---+
Conclusion
Pivoting and unpivoting data are essential transformation techniques used in data processing. Spark provides various functions and methods to perform these transformations efficiently. In this blog, we discussed how to pivot and unpivot data in Spark with examples.