How to do String Manipulation in Spark

Apache Spark is a powerful distributed computing framework that can process large amounts of data in a fast and scalable manner. One common task in data processing is string manipulation. In this blog, we will discuss some of the ways to perform string manipulation in Spark.

Concatenation

Concatenation is the process of joining two or more strings together. In Spark, we can use the concat function to concatenate two or more string columns. Here's an example:

import org.apache.spark.sql.functions._

val df = spark.createDataFrame([(1, "John", "Doe"), 
    (2, "Jane", "Doe")], 
    ["id", "first_name", "last_name"]) 
    
df.select(concat(df.first_name, df.last_name).alias("full_name")).show() 

Output:

+---------+ 
|full_name| 
+---------+ 
| JohnDoe | 
| JaneDoe | 
+---------+ 
Datathreads Advertisement - On-Premise ETL,BI, and AI Platform

Substring

Substring is a portion of a string. In Spark, we can use the substr function to extract a substring from a string. Here's an example:

import org.apache.spark.sql.functions._ 
        
val df = spark.createDataFrame([(1, "John", "Doe"), 
    (2, "Jane", "Doe")], 
    ["id", "first_name", "last_name"]) 
    
df.select(substring(df.first_name, 1, 2).alias("initials")).show() 

Output:

+--------+ 
|initials| 
+--------+ 
| Jo| 
| Ja| 
+--------+ 
Datathreads Advertisement - On-Premise ETL,BI, and AI Platform

Trim

Trim is a function that removes leading and trailing spaces from a string. In Spark, we can use the trim function to remove leading and trailing spaces from a string. Here's an example:

import org.apache.spark.sql.functions._ 
        
val df = spark.createDataFrame([(1, " John ", " Doe "), 
    (2, " Jane", "Doe ")], 
    ["id", "first_name", "last_name"]) 
    
df.select(trim(df.first_name).alias("first_name"), trim(df.last_name).alias("last_name")).show() 

Output:

+----------+---------+ 
|first_name|last_name| 
+----------+---------+ 
| John| Doe| 
| Jane| Doe| 
+----------+---------+ 
Datathreads Advertisement - On-Premise ETL,BI, and AI Platform

Replace

Replace is a function that replaces a substring in a string with another substring. In Spark, we can use the replace function to replace a substring in a string. Here's an example:

import org.apache.spark.sql.functions._  
        
val df = spark.createDataFrame([(1, "John", "Doe"), 
    (2, "Jane", "Doe")], 
    ["id", "first_name", "last_name"]) 
    
df.select(replace(df.first_name, "Jo", "Ja").alias("first_name"), 
    replace(df.last_name, "o", "a").alias("last_name")).show() 

Output:

+----------+---------+ 
|first_name|last_name| 
+----------+---------+ 
| Jan | Dae| 
| Jane| Dae| 
+----------+---------+ 

Conclusion

String manipulation is a common task in data processing. In this blog, we have discussed some of the ways to perform string manipulation in Spark, including concatenation, substring, trim, and replace. These functions can be used to transform and clean string data in Spark.