Python to Pyspark Regex: Converting Strings to list

Question

My code takes a string and extract elements within it to create a list.

Here is an example a string:

'["A","B"]'

Here is the python code:

df[column + '_upd'] = df[column].apply(lambda x: re.findall('\"(.*?)\"',x.lower()))

This results in a list that includes "A" and "B".

I'm brand new to pyspark and am a bit lost on how to do this. Ive seen people use regexp_extract but that doesn't quite apply to this problem.

Any help would be much appreciated

Is the question basically "how do I convert a stringified list to a list"? If so, ast.literal_eval(s) is probably the best bet. — ggorlen
– ggorlen, Commented May 19, 2020 at 21:40

murtihash · Accepted Answer · 2020-05-19 20:24:14Z

1

You can use regexp_replace and split.

from pyspark.sql import functions as F
df.withColumn("new_col", F.split(F.regexp_replace("col", '\[|]| |"', ''),",")).show()

#+---------+-------+
#|      col|new_col|
#+---------+-------+
#|["A","B"]| [A, B]|
#+---------+-------+

#schema
 #root
 #|-- col: string (nullable = true)
 #|-- new_col: array (nullable = true)
 #|    |-- element: string (containsNull = true)

answered May 19, 2020 at 20:24

murtihash

8,4401 gold badge16 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python to Pyspark Regex: Converting Strings to list

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related