0

I have a dataset that contains some nested pyspark rows stored as strings. When I read them into pyspark, of the columns is read as a string that look something like this:

'Row(name='Bob', updated='Sat Nov 21 12:57:54', isProgrammer=True)'

My goal is to parse some of these subfields into separate columns, but I am having trouble reading them in. .

df.select(col('user')['name'].alias('name'))

is the syntax I am trying, but it doesn't seem to be working. It gives me this error:

Can't extract value from user#11354: need struct type but got string

Is there an easy way to read this type of data?

1 Answer 1

1

Considering you can't change the input you get the code below runs a udf the eval the Row and then takes all columns into an array of strings. You can tinker with the udf to make it return as a MapType or a StructType.

I would highly recommend changing the input though from this format.

from pyspark.sql.types import ArrayType, StringType
from pyspark.sql.functions import udf

@udf(returnType=ArrayType(StringType()))
def custom_parser_udf(unparsed_row):
  from pyspark.sql import Row
  
  as_row = eval(unparsed_row)
  return [str(as_row.name), str(as_row.updated), str(as_row.isProgrammer)]

and then you can run it in a function to get whatever you need.

result = df.withColumn("parsed_date", custom_parser_udf("unparsed_col"))

an alternative is to run some sort of parser based on split function which also, I can't recommend .

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.