I want to parse a JSON request and create multiple columns out of it in pyspark as follows:
{
"ID": "abc123",
"device": "mobile",
"Ads": [
{
"placement": "topright",
"Adlist": [
{
"name": "ad1",
"subtype": "placeholder1",
"category": "socialmedia",
},
{
"name": "ad2",
"subtype": "placeholder2",
"category": "media",
},
{
"name": "ad3",
"subtype": "placeholder3",
"category": "printingpress",
}
]
},
{
"Placement": "bottomleft",
"Adlist": [
{
"name": "ad4",
"subtype": "placeholder4",
"category": "socialmedia",
},
{
"name": "ad5",
"subtype": "placeholder5",
"category": "media",
},
{
"name": "ad6",
"subtype": "placeholder6",
"category": "printingpress",
}
]
}
]
}
I tried the following:
df = spark.read.option("multiline", "true").json(json_file_location)
df_schema = df.schema
exploded_df = df.withColumn("data", from_json("data", df_schema)).select(col('data.*'))
But getting the error
[Datatype_mismatch.unexpected_input_type] can not resolve "from_json(data)" due to data type mismatch" parameter 1 requires the "STRING" type, however "data" has the type "ARRAY<ARRAY<STRING>>"
Schema look something similar but much larger and nested than this. This is just a smaller example
root
|-- Ads: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Adlist: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- category: string (nullable = true)
| | | | |-- name: string (nullable = true)
| | | | |-- subtype: string (nullable = true)
| | |-- placement: string (nullable = true)
|-- ID: string (nullable = true)
|-- device: string (nullable = true)
output I am looking for is as follows:
Id | device | placement | name | subtype | category |
-------------------------------------------------------------------
abc123| mobile | topright | ad1 | placeholder1 | socialmedia |
abc123| mobile | topright | ad2 | placeholder2 | media |
abc123| mobile | topright | ad3 | placeholder3 | printingpress |
abc123| mobile |bottomleft | ad4 | placeholder4 | socialmedia |
abc123| mobile |bottomleft | ad5 | placeholder5 | media |
abc123| mobile |bottomleft | ad6 | placeholder6 | printingpress |
Is there a way to achieve it using in built pyspark sql fucntion without hard coding schema (Actual schema is much larger, complex and nested multilevel) There are two parents column
- Data (Array Type) - Multilevel Nesting
- details (Struct Type) - - Multilevel Nesting I only need to flatten it using pyspark inbuilt function without soecifying the explicit schema