1

My code takes a string and extract elements within it to create a list.

Here is an example a string:

'["A","B"]'

Here is the python code:

df[column + '_upd'] = df[column].apply(lambda x: re.findall('\"(.*?)\"',x.lower()))

This results in a list that includes "A" and "B".

I'm brand new to pyspark and am a bit lost on how to do this. Ive seen people use regexp_extract but that doesn't quite apply to this problem.

Any help would be much appreciated

3
  • I don't understand. Is this a pandas frame or pyspark? Commented May 19, 2020 at 20:22
  • its code in pandas that i need to transfer to pyspark Commented May 19, 2020 at 20:23
  • Is the question basically "how do I convert a stringified list to a list"? If so, ast.literal_eval(s) is probably the best bet. Commented May 19, 2020 at 21:40

1 Answer 1

1

You can use regexp_replace and split.

from pyspark.sql import functions as F
df.withColumn("new_col", F.split(F.regexp_replace("col", '\[|]| |"', ''),",")).show()

#+---------+-------+
#|      col|new_col|
#+---------+-------+
#|["A","B"]| [A, B]|
#+---------+-------+

#schema
 #root
 #|-- col: string (nullable = true)
 #|-- new_col: array (nullable = true)
 #|    |-- element: string (containsNull = true)
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.