Pyspark replace strings in Spark dataframe column

Question

I'd like to perform some basic stemming on a Spark Dataframe column by replacing substrings. What's the quickest way to do this?

In my current use case, I have a list of addresses that I want to normalize. For example this dataframe:

id     address
1       2 foo lane
2       10 bar lane
3       24 pants ln

Would become

id     address
1       2 foo ln
2       10 bar ln
3       24 pants ln

What's your Spark version?

Daniel de Paula
– Daniel de Paula

2016-05-04 21:19:05 +00:00
Commented May 4, 2016 at 21:19 — Daniel de Paula
– Daniel de Paula, Commented May 4, 2016 at 21:19

L.Stefan · Accepted Answer · 2024-09-18 13:28:03Z

192

For Spark 1.5 or later, you can use the functions package:

from pyspark.sql.functions import regexp_replace
newDf = df.withColumn('address', regexp_replace('address', 'lane', 'ln'))

Quick explanation:

The function withColumn is called to add (or replace, if the name exists) a column to the data frame.
The function regexp_replace will generate a new column by replacing all substrings that match the pattern.

edited Sep 18, 2024 at 13:28

L.Stefan

3562 silver badges15 bronze badges

answered May 4, 2016 at 21:19

Daniel de Paula

18k9 gold badges75 silver badges73 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

lfvv Over a year ago

Just remember that the first parameter of regexp_replace refers to the column being changed, the second is the regex to find and the last is how to replace it.

Kailegh Over a year ago

can I use regexp_replace inside a pipeline? Thanks

user15050871 Over a year ago

Can we change more than one item in this code?

gbeaven Over a year ago

@elham you can change any value that fits a regexp expression for one column using this function: spark.apache.org/docs/2.2.0/api/R/regexp_replace.html

GreenEye Over a year ago

Can this be adapted to replace only if entire string is matched and not substring? i.e., if I wanted to replace 'lane' by 'ln' but keep 'skylane' unchanged?

|

loneStar · Accepted Answer · 2019-07-23 19:34:21Z

12

For scala

import org.apache.spark.sql.functions.regexp_replace
import org.apache.spark.sql.functions.col
data.withColumn("addr_new", regexp_replace(col("addr_line"), "\\*", ""))

answered Jul 23, 2019 at 19:34

loneStar

4,04026 silver badges42 bronze badges

Comments

Yike Lu · Accepted Answer · 2024-01-05 15:50:37Z

1

My suggestion is to import the sql function package and make use of withColumn function to modify the existing column in the df. In this case we need to replace address column data having lane as ln.

from pyspark.sql.functions import *

replacedf = df.withColumn('address', regexp_replace('address', 'lane', 'ln'))

edited Jan 5, 2024 at 15:50

Yike Lu

1,03511 silver badges18 bronze badges

answered Jan 4, 2024 at 7:20

Aishwarya

212 bronze badges

Comments

Michał Jabłoński · Accepted Answer · 2024-11-05 12:37:02Z

0

In Spark 3.5 they introduced the replace function which accepts Column arguments, and is pretty efficient.

Works like this:

df = spark.createDataFrame([("ABCabc", "abc", "DEF",)], ["a", "b", "c"])
df.select(replace(df.a, df.b, df.c).alias('r')).show()

answered Nov 5, 2024 at 12:37

Michał Jabłoński

1,2871 gold badge17 silver badges16 bronze badges

Collectives™ on Stack Overflow

Pyspark replace strings in Spark dataframe column

4 Answers 4

6 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

6 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related