86

I'd like to perform some basic stemming on a Spark Dataframe column by replacing substrings. What's the quickest way to do this?

In my current use case, I have a list of addresses that I want to normalize. For example this dataframe:

id     address
1       2 foo lane
2       10 bar lane
3       24 pants ln

Would become

id     address
1       2 foo ln
2       10 bar ln
3       24 pants ln
1
  • 1
    What's your Spark version? Commented May 4, 2016 at 21:19

4 Answers 4

192

For Spark 1.5 or later, you can use the functions package:

from pyspark.sql.functions import regexp_replace
newDf = df.withColumn('address', regexp_replace('address', 'lane', 'ln'))

Quick explanation:

  • The function withColumn is called to add (or replace, if the name exists) a column to the data frame.
  • The function regexp_replace will generate a new column by replacing all substrings that match the pattern.
Sign up to request clarification or add additional context in comments.

6 Comments

Just remember that the first parameter of regexp_replace refers to the column being changed, the second is the regex to find and the last is how to replace it.
can I use regexp_replace inside a pipeline? Thanks
Can we change more than one item in this code?
@elham you can change any value that fits a regexp expression for one column using this function: spark.apache.org/docs/2.2.0/api/R/regexp_replace.html
Can this be adapted to replace only if entire string is matched and not substring? i.e., if I wanted to replace 'lane' by 'ln' but keep 'skylane' unchanged?
|
12

For scala

import org.apache.spark.sql.functions.regexp_replace
import org.apache.spark.sql.functions.col
data.withColumn("addr_new", regexp_replace(col("addr_line"), "\\*", ""))

Comments

1

My suggestion is to import the sql function package and make use of withColumn function to modify the existing column in the df. In this case we need to replace address column data having lane as ln.

from pyspark.sql.functions import *

replacedf = df.withColumn('address', regexp_replace('address', 'lane', 'ln'))

Comments

0

In Spark 3.5 they introduced the replace function which accepts Column arguments, and is pretty efficient.

Works like this:

df = spark.createDataFrame([("ABCabc", "abc", "DEF",)], ["a", "b", "c"])
df.select(replace(df.a, df.b, df.c).alias('r')).show()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.