Replace a substring of a string in pyspark dataframe

Question

How to replace substrings of a string. For example, I created a data frame based on the following json format.

line1:{"F":{"P3":"1:0.01","P8":"3:0.03,4:0.04", ...},"I":"blah"}
line2:{"F":{"P4":"2:0.01,3:0.02","P10":"5:0.02", ...},"I":"blah"}

I need to replace the substrings "1:", "2:", "3:", with "a:", "b:", "c:", and etc. So the result will be:

line1:{"F":{"P3":"a:0.01","P8":"c:0.03,d:0.04", ...},"I":"blah"}
line2:{"F":{"P4":"b:0.01,c:0.02","P10":"e:0.02", ...},"I":"blah"}

Please consider that this is just an example the real replacement is substring replacement not character replacement.

Any guidance either in Scala or Pyspark is helpful.

Oh, sorry I think my explanation is confusing. I only want to replace the numbers in the string after ":". Basically, P1, P2, ... Pn are keys and I don't want to replace the keys or change their names. I only want to replace the strings in the values ==> "1:" to "a:", "2:" to "b:" and so on. — Alan
– Alan, Commented Aug 22, 2019 at 23:56
Like what? How is 27 translated? How is 32521 translated? — jwvh
– jwvh, Commented Aug 23, 2019 at 0:13
so the whole string before ":" is replaced with a new string. "1:" to "hello_word:", "2:" to "another_hello_word",... "27:" to "how_are_you:", "50:" to "how_am_I". Let's say you have a dictionary (map) that maps numbers to a string, the size of the map can change and it is not necessary 27... and I want to replace the number (as key in the dictionary) with it's value that can be one of those examples that I put. So it is not necessarily 27 numbers but it can be higher numbers, where it can change during time... I mean the map can change as well as its size. — Alan
– Alan, Commented Aug 23, 2019 at 0:18

Pushkr · Accepted Answer · 2019-08-22 22:31:06Z

3

from pyspark.sql.functions import *       
newDf = df.withColumn('col_name', regexp_replace('col_name', '1:', 'a:'))

Details here: Pyspark replace strings in Spark dataframe column

edited Aug 22, 2019 at 22:31

Pushkr

3,62921 silver badges32 bronze badges

answered Aug 22, 2019 at 22:04

P. Phalak

5071 gold badge5 silver badges11 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Alan Over a year ago

Thanks for the guidance, I am looking for something that does all the replacements at once. For example based on a map (dictionary) replacing all the keys with the values ==> as "1:" to "a:", "2:" to "b:", and so on.

Alan Over a year ago

This solution is partial, the one in the link will work properly. The only problem is that it doesn't solve the question fully. As I my question show the json file after conversion to dataframe will have two columns as "I" and "F" where "F" datatype is a struct<string, string,....> and when I try to use the solution in the shared link will have error due to datatype mismatch as it looks for string but column "F" datatype is not string.

jwvh · Accepted Answer · 2019-08-23 06:43:48Z

1

Let's say you have a collection of strings for possible modification (simplified for this example).

val data = Seq("1:0.01"
              ,"3:0.03,4:0.04"
              ,"2:0.01,3:0.02"
              ,"5:0.02")

And you have a dictionary of required conversions.

val num2name = Map("1" -> "A"
                  ,"2" -> "Bo"
                  ,"3" -> "Cy"
                  ,"4" -> "Dee")

From here you can use replaceSomeIn() to make the substitutions.

data.map("(\\d+):".r  //note: Map key is only part of the match pattern
                  .replaceSomeIn(_, m => num2name.get(m group 1)  //get replacement
                                                 .map(_ + ":")))  //restore ":"
//res0: Seq[String] = List(A:0.01
//                        ,Cy:0.03,Dee:0.04
//                        ,Bo:0.01,Cy:0.02
//                        ,5:0.02)

As you can see, "5:" is a match for the regex pattern but since the 5 part is not defined in num2name, the string is left unchanged.

edited Aug 23, 2019 at 6:43

answered Aug 23, 2019 at 6:16

jwvh

51.3k7 gold badges42 silver badges70 bronze badges

1 Comment

Alan Over a year ago

Thanks for your responding. Do you know how to do it in pyspark? what is the corresponding function for "replaceSomeIn" in pyspark?

Alan · Accepted Answer · 2023-08-13 07:48:56Z

1

This is the way I solved it in PySpark:

def _name_replacement(val, ordered_mapping):
    for key, value in ordered_mapping.items():
        val = val.replace(key, value)
    return val

mapping = {"1:":"aaa:", "2:":"bbb:", ..., "24:":"xxx:", "25:":"yyy:", ....}
ordered_mapping = OrderedDict(reversed(sorted(mapping.items(), key=lambda t: int(t[0][:-1]))))
replacing = udf(lambda x: _name_replacement(x, ordered_mapping))
new_df = df.withColumn("F", replacing(col("F")))

edited Aug 13, 2023 at 7:48

answered Aug 27, 2019 at 17:46

Alan

4691 gold badge8 silver badges23 bronze badges

Collectives™ on Stack Overflow

Replace a substring of a string in pyspark dataframe

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related