Converting Pandas DataFrames to Parquet Format: A Comprehensive Guide

Introduction

link to this section

Parquet is an open-source file format available to any project in the Hadoop ecosystem. Apache Parquet is designed for efficient as well as performant flat columnar storage format of data compared to row-based files like CSV or TSV files. It works exceptionally well with complex data types and nested structures. Pandas, being one of the most popular data manipulation libraries in Python, provides an easy-to-use method to convert DataFrames into Parquet format.

Why Choose Parquet?

link to this section
  • Columnar Storage : Instead of storing data as a row, Parquet stores it column-wise, which makes it easy to compress and you end up saving storage.
  • Schema Evolution : Parquet supports schema evolution. You can add new columns or drop existing ones.
  • Performance : It’s heavily optimized for complex nested data structures and provides faster data retrieval.
  • Compatibility : Works well with a variety of data processing tools like Apache Spark, Apache Hive, Apache Impala, and Apache Arrow.

Installing Required Libraries

link to this section

Before converting a DataFrame to Parquet, ensure that you have installed pandas and pyarrow or fastparquet since Pandas requires either of them for handling Parquet files:

pip install pandas pyarrow 
# or 
pip install pandas fastparquet 

Basic Conversion

link to this section

Converting a DataFrame to a Parquet file is straightforward. Here is how you can do it:

import pandas as pd 
    
# Creating a sample DataFrame 
data = {'Name': ['Tom', 'Jack', 'Steve', 'Ricky'], 
    'Age': [28, 34, 29, 42], 
    'Address': ['New York', 'Toronto', 'San Francisco', 'Seattle'], 
    'Qualification': ['MBA', 'BCA', 'M.Tech', 'MBA']} 
df = pd.DataFrame(data) 

# Convert DataFrame to Parquet 
df.to_parquet('output.parquet') 

Reading Parquet Files

link to this section

You can also read a Parquet file back to a DataFrame with pd.read_parquet :

df = pd.read_parquet('output.parquet') 
print(df) 

Specifying Compression

link to this section

Parquet supports various compression algorithms. You can specify the compression type using the compression parameter:

df.to_parquet('output.parquet', compression='gzip') 

Working with S3 Buckets

link to this section

You can read and write DataFrames directly from/to a S3 bucket if the required libraries are installed:

df.to_parquet('s3://mybucket/output.parquet') 
df = pd.read_parquet('s3://mybucket/output.parquet') 

Make sure you have the s3fs library installed and configured:

pip install s3fs 

Partitioning

link to this section

Parquet supports partitioning of data based on column values. Partitioning divides your dataset into multiple files, one per value of the partitioned column:

df.to_parquet('output.parquet', partition_cols=['Name']) 

Each unique value in the "Name" column will result in a separate Parquet file.

Conclusion

link to this section

Storing your data in Parquet format can lead to significant improvements in both storage space and query performance. With the simple and well-documented pandas interface, converting your data to this efficient format is hassle-free.

The ability to read from and write to various sources like local file systems, distributed file systems, and cloud storage, as well as support for different compression algorithms, makes pandas and Parquet a powerful combination for handling large datasets.