2

How to replace substrings of a string. For example, I created a data frame based on the following json format.

line1:{"F":{"P3":"1:0.01","P8":"3:0.03,4:0.04", ...},"I":"blah"}
line2:{"F":{"P4":"2:0.01,3:0.02","P10":"5:0.02", ...},"I":"blah"}

I need to replace the substrings "1:", "2:", "3:", with "a:", "b:", "c:", and etc. So the result will be:

line1:{"F":{"P3":"a:0.01","P8":"c:0.03,d:0.04", ...},"I":"blah"}
line2:{"F":{"P4":"b:0.01,c:0.02","P10":"e:0.02", ...},"I":"blah"}

Please consider that this is just an example the real replacement is substring replacement not character replacement.

Any guidance either in Scala or Pyspark is helpful.

5
  • Oh, sorry I think my explanation is confusing. I only want to replace the numbers in the string after ":". Basically, P1, P2, ... Pn are keys and I don't want to replace the keys or change their names. I only want to replace the strings in the values ==> "1:" to "a:", "2:" to "b:" and so on. Commented Aug 22, 2019 at 23:56
  • 1: to a:, and 2: to b:, and 27: to ...? Commented Aug 23, 2019 at 0:08
  • yes, something like that. Commented Aug 23, 2019 at 0:10
  • Like what? How is 27 translated? How is 32521 translated? Commented Aug 23, 2019 at 0:13
  • so the whole string before ":" is replaced with a new string. "1:" to "hello_word:", "2:" to "another_hello_word",... "27:" to "how_are_you:", "50:" to "how_am_I". Let's say you have a dictionary (map) that maps numbers to a string, the size of the map can change and it is not necessary 27... and I want to replace the number (as key in the dictionary) with it's value that can be one of those examples that I put. So it is not necessarily 27 numbers but it can be higher numbers, where it can change during time... I mean the map can change as well as its size. Commented Aug 23, 2019 at 0:18

3 Answers 3

3
from pyspark.sql.functions import *       
newDf = df.withColumn('col_name', regexp_replace('col_name', '1:', 'a:'))

Details here: Pyspark replace strings in Spark dataframe column

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the guidance, I am looking for something that does all the replacements at once. For example based on a map (dictionary) replacing all the keys with the values ==> as "1:" to "a:", "2:" to "b:", and so on.
This solution is partial, the one in the link will work properly. The only problem is that it doesn't solve the question fully. As I my question show the json file after conversion to dataframe will have two columns as "I" and "F" where "F" datatype is a struct<string, string,....> and when I try to use the solution in the shared link will have error due to datatype mismatch as it looks for string but column "F" datatype is not string.
1

Let's say you have a collection of strings for possible modification (simplified for this example).

val data = Seq("1:0.01"
              ,"3:0.03,4:0.04"
              ,"2:0.01,3:0.02"
              ,"5:0.02")

And you have a dictionary of required conversions.

val num2name = Map("1" -> "A"
                  ,"2" -> "Bo"
                  ,"3" -> "Cy"
                  ,"4" -> "Dee")

From here you can use replaceSomeIn() to make the substitutions.

data.map("(\\d+):".r  //note: Map key is only part of the match pattern
                  .replaceSomeIn(_, m => num2name.get(m group 1)  //get replacement
                                                 .map(_ + ":")))  //restore ":"
//res0: Seq[String] = List(A:0.01
//                        ,Cy:0.03,Dee:0.04
//                        ,Bo:0.01,Cy:0.02
//                        ,5:0.02)

As you can see, "5:" is a match for the regex pattern but since the 5 part is not defined in num2name, the string is left unchanged.

1 Comment

Thanks for your responding. Do you know how to do it in pyspark? what is the corresponding function for "replaceSomeIn" in pyspark?
1

This is the way I solved it in PySpark:

def _name_replacement(val, ordered_mapping):
    for key, value in ordered_mapping.items():
        val = val.replace(key, value)
    return val

mapping = {"1:":"aaa:", "2:":"bbb:", ..., "24:":"xxx:", "25:":"yyy:", ....}
ordered_mapping = OrderedDict(reversed(sorted(mapping.items(), key=lambda t: int(t[0][:-1]))))
replacing = udf(lambda x: _name_replacement(x, ordered_mapping))
new_df = df.withColumn("F", replacing(col("F")))

   

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.