Deep Dive into Delta Lake on Apache Spark
Delta Lake, an open-source storage layer that brings ACID (Atomicity, Consistency, Isolation, Durability) transaction capabilities to Apache Spark, significantly simplifies data pipeline development, providing reliability and data consistency. This article will guide you through the initial stages of working with Delta Lake on Apache Spark, covering its installation, table creation, data operations, updates, and deletions.
1. Prerequisites
Before starting with Delta Lake, Apache Spark needs to be installed on your system. Delta Lake is compatible with Spark 3.0 or later. If Spark isn't already installed on your system, follow the Spark Installation Guide to get started.
2. Installing Delta Lake
Delta Lake can be added to your Spark project using the following SBT dependency:
libraryDependencies += "io.delta" %% "delta-core" % "0.8.0"
Alternatively, you can download and link the Delta Lake JAR file directly to your Spark application. Just make sure that the JAR file is in the classpath of your application.
3. Creating a Delta Table
Delta tables are the core component of Delta Lake. They store data and are similar to the tables in a relational database. However, Delta tables provide features such as ACID transactions and time-traveling. Creating a Delta table involves writing a DataFrame in the Delta Lake format. Here's an example of creating a table using PySpark:
data = spark.range(0,5) data.write.format("delta").save("/tmp/delta-table")
The function range(0,5)
generates a DataFrame with a single column named id
, containing elements from 0 to 4. The statement write.format("delta").save("/tmp/delta-table")
writes the DataFrame to a Delta table at the specified path. The Delta table is saved to disk in the Parquet columnar storage format, and a transaction log is kept for maintaining the table's state.
4. Writing to a Delta Table
The process of writing to a Delta table is similar to writing a DataFrame, but you use the "delta" format. Here's an example of how you can write to a Delta table:
data = spark.range(5,10) data.write.format("delta").mode("overwrite").save("/tmp/delta-table")
This code overwrites the existing Delta table with new data. The mode("overwrite")
command specifies that the existing data should be overwritten. If you want to append data to the existing table, you can use mode("append")
.
Delta Lake stores all the historical versions of your data. This means you can access earlier versions of the table, an amazing feature known as "Time Travel".
5. Reading from a Delta Table
Reading from a Delta table is similar to reading from a regular Parquet table. Here's how you can do it:
df = spark.read.format("delta").load("/tmp/delta-table") df.show()
This code reads the Delta table we wrote to earlier and displays its contents. If you want to query an earlier version of the table, you can do so with the option
method like this:
df = spark.read.format("delta").option("versionAsOf", 0).load("/tmp/delta-table")
6. Updating and Deleting Records in a Delta Table
One of the key features of Delta Lake is the ability to modify existing data, which isn't available in other systems like Apache Parquet or Apache Hadoop. You can modify the data in the tables using UPDATE, DELETE, and MERGE INTO commands. Here's an example of updating a table:
from delta.tables import DeltaTable
deltaTable = DeltaTable.forPath(spark, "/tmp/delta-table")
deltaTable.update(condition = "id % 2 == 0", set = {"id": expr("id + 1")})
In this code, the update
method is used to increment the id
of all even id
s by 1. Similarly, you can use the delete
method to remove rows from the table:
deltaTable.delete(condition = "id % 2 == 0")
This command deletes the rows where id
is even.
7. Schema Evolution
Delta Lake supports schema evolution, allowing you to easily change the schema of your table after it has been created. You can add new columns or change the data type of existing columns using the mergeSchema
option:
data = spark.range(10,15).toDF("newColumn")
data.write.format("delta").option("mergeSchema", "true").mode("append").save("/tmp/delta-table")
In this example, a new column newColumn
is added to the existing Delta table.
Conclusion
Delta Lake is an invaluable tool that amplifies the capabilities of Apache Spark. By providing ACID transactions, scalable metadata handling, and the unification of streaming and batch data processing, it greatly simplifies data architecture and bolsters reliability and efficiency. While this guide touches upon the basic principles of getting started with Delta Lake on Apache Spark, it is recommended that you delve deeper into the official Delta Lake documentation to explore advanced features and derive the maximum benefit from this robust tool.