Parsing pyspark Row stored as string

Question

I have a dataset that contains some nested pyspark rows stored as strings. When I read them into pyspark, of the columns is read as a string that look something like this:

'Row(name='Bob', updated='Sat Nov 21 12:57:54', isProgrammer=True)'

My goal is to parse some of these subfields into separate columns, but I am having trouble reading them in. .

df.select(col('user')['name'].alias('name'))

is the syntax I am trying, but it doesn't seem to be working. It gives me this error:

Can't extract value from user#11354: need struct type but got string

Is there an easy way to read this type of data?

walking · Accepted Answer · 2022-06-03 23:09:50Z

1

Considering you can't change the input you get the code below runs a udf the eval the Row and then takes all columns into an array of strings. You can tinker with the udf to make it return as a MapType or a StructType.

I would highly recommend changing the input though from this format.

from pyspark.sql.types import ArrayType, StringType
from pyspark.sql.functions import udf

@udf(returnType=ArrayType(StringType()))
def custom_parser_udf(unparsed_row):
  from pyspark.sql import Row
  
  as_row = eval(unparsed_row)
  return [str(as_row.name), str(as_row.updated), str(as_row.isProgrammer)]

and then you can run it in a function to get whatever you need.

result = df.withColumn("parsed_date", custom_parser_udf("unparsed_col"))

an alternative is to run some sort of parser based on split function which also, I can't recommend .

answered Jun 3, 2022 at 23:09

walking

9906 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Parsing pyspark Row stored as string

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related