pyspark transform json array into multiple columns

Question

I'm using the below code to read data from an api where the payload is in json format using pyspark in azure databricks. All the fields are defined as string but keep running into json_tuple requires that all arguments are strings error.

Schema:

root
 |-- Payload: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- ActiveDate: string (nullable = true)
 |    |    |-- BusinessId: string (nullable = true)
 |    |    |-- BusinessName: string (nullable = true)

JSON:

 {
    "Payload": 
    [
        {
            "ActiveDate": "2008-11-25",
            "BusinessId": "5678",
            "BusinessName": "ACL"
        },
        {
            "ActiveDate": "2009-03-22",
            "BusinessId": "6789",
            "BusinessName": "BCL"
        }
    ]
}

PySpark:

from pyspark.sql import functions as F
df = df.select(F.col('Payload'), F.json_tuple(F.col('Payload'), 'ActiveDate', 'BusinessId', 'BusinessName') \.alias('ActiveDate', 'BusinessId', 'BusinessName'))
df.write.format("delta").mode("overwrite").saveAsTable("delta_payload")

Error:

AnalysisException: cannot resolve 'json_tuple(`Payload`, 'ActiveDate', 'BusinessId', 'BusinessName')' due to data type mismatch: json_tuple requires that all arguments are strings;

Czaporka · Accepted Answer · 2021-07-13 20:20:06Z

1

From your schema it looks like the JSON is already parsed, so Payload is of ArrayType rather than StringType containing JSON, hence the error.

You probably need explode instead of json_tuple:

>>> from pyspark.sql.functions import explode
>>> df = spark.createDataFrame([{
...     "Payload":
...     [
...         {
...             "ActiveDate": "2008-11-25",
...             "BusinessId": "5678",
...             "BusinessName": "ACL"
...         },
...         {
...             "ActiveDate": "2009-03-22",
...             "BusinessId": "6789",
...             "BusinessName": "BCL"
...         }
...     ]
... }])
>>> df.schema
StructType(List(StructField(Payload,ArrayType(MapType(StringType,StringType,true),true),true)))
>>> df.select(explode("Payload").alias("x")).select("x.ActiveDate", "x.BusinessName", "x.BusinessId").show()
+----------+------------+----------+
|ActiveDate|BusinessName|BusinessId|
+----------+------------+----------+
|2008-11-25|         ACL|      5678|
|2009-03-22|         BCL|      6789|
+----------+------------+----------+

answered Jul 13, 2021 at 20:20

Czaporka

2,4363 gold badges13 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

paone Over a year ago

Hi @Czaporka, I run into the error NameError: name 'StructType' is not defined while trying to use this line of code. StructType(List(StructField(Report_Entry,ArrayType(MapType(StringType,StringType,true),true),true)))

Czaporka Over a year ago

Hi @paone, that line is just the output I got after typing df.schema in an interactive interpreter session. I included it just to show the schema of my DataFrame. The lines that you should execute are prefixed with >>> . Most importantly, you probably need the first one (import explode) and the last one with the select.

paone Over a year ago

Hi @Czaporka, Thank you. That works. But the issue I now encounter is df.show(truncate=False) shows data in a tabular format +----------+------------+----------+ |ActiveDate|BusinessName|BusinessId| +----------+------------+----------+ |2008-11-25| ACL| 5678| |2009-03-22| BCL| 6789| +----------+------------+----------+ delta_tbl results are array df.write.format("delta").mode("overwrite").saveAsTable("delta_tbl") spark.sql("SELECT * FROM delta_tbl LIMIT 1")

Czaporka Over a year ago

@paone did you assign the result of the select with the explode back to df? In my code sample, I did df.select(...).show() again just to show what that select is going to return; but in your actual code you'd need to assign its result to df before writing it, i.e. df = df.select(...); df.write.format(...)... (like in your original code), or just do df.select(...).write.format(...)....

paone Over a year ago

Hi @Czaporka, Yes I did, works now. Thank you.

Collectives™ on Stack Overflow

pyspark transform json array into multiple columns

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related