Pyspark - Resolve nested json file into multiple columns using inbuilt spark functions

Question

I want to parse a JSON request and create multiple columns out of it in pyspark as follows:

{
  "ID": "abc123",
  "device": "mobile",
  "Ads": [
    {
      "placement": "topright",
      "Adlist": [
        {
          "name": "ad1",
          "subtype": "placeholder1",
          "category": "socialmedia",
        },
        {
          "name": "ad2",
          "subtype": "placeholder2",
          "category": "media",
        },
        {
          "name": "ad3",
          "subtype": "placeholder3",
          "category": "printingpress",
        }
      ]
    },
    {
      "Placement": "bottomleft",
      "Adlist": [
        {
          "name": "ad4",
          "subtype": "placeholder4",
          "category": "socialmedia",
        },
        {
          "name": "ad5",
          "subtype": "placeholder5",
          "category": "media",
        },
        {
          "name": "ad6",
          "subtype": "placeholder6",
          "category": "printingpress",
        }
      ]
    }
  ]
}

I tried the following:

df = spark.read.option("multiline", "true").json(json_file_location)
df_schema = df.schema
exploded_df = df.withColumn("data", from_json("data", df_schema)).select(col('data.*'))

But getting the error

[Datatype_mismatch.unexpected_input_type] can not resolve "from_json(data)" due to data type mismatch" parameter 1 requires the "STRING" type, however "data" has the type "ARRAY<ARRAY<STRING>>"

Schema look something similar but much larger and nested than this. This is just a smaller example

root
 |-- Ads: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Adlist: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- category: string (nullable = true)
 |    |    |    |    |-- name: string (nullable = true)
 |    |    |    |    |-- subtype: string (nullable = true)
 |    |    |-- placement: string (nullable = true)
 |-- ID: string (nullable = true)
 |-- device: string (nullable = true)

output I am looking for is as follows:

Id    | device | placement | name  | subtype      | category      |
-------------------------------------------------------------------
abc123| mobile | topright  | ad1   | placeholder1 | socialmedia   |
abc123| mobile | topright  | ad2   | placeholder2 | media         |
abc123| mobile | topright  | ad3   | placeholder3 | printingpress |
abc123| mobile |bottomleft | ad4   | placeholder4 | socialmedia   |
abc123| mobile |bottomleft | ad5   | placeholder5 | media         |
abc123| mobile |bottomleft | ad6   | placeholder6 | printingpress |

Is there a way to achieve it using in built pyspark sql fucntion without hard coding schema (Actual schema is much larger, complex and nested multilevel) There are two parents column

Data (Array Type) - Multilevel Nesting
details (Struct Type) - - Multilevel Nesting I only need to flatten it using pyspark inbuilt function without soecifying the explicit schema

Adam · Accepted Answer · 2025-09-23 16:27:37Z

Hmm. At first blush, I wonder if its the nested array in the JSON string?

Let's say your JSON string is in a variable named source_json_string. I'd stick the whole thing in a Spark dataframe:

from pyspark.sql.functions import *

sdf_raw = spark.createDataFrame([source_json_string], StringType()).toDF("source_json_string")

sdf_raw.display()

Then I'd parse the JSON string.

sdf_parse_json_string = sdf_raw.select(
    from_json(col("source_json_string"), 
              schema="ID string, device string, Ads array<struct<placement:string, Adlist:array<struct<name:string, subtype:string, category:string>>>>")
    .alias("source_json_parsed")
)

sdf_parse_json_string.display()

Did you see what I id there to account for the arrays in your source data?

Then I'd tease out the individual fields from the nested JSON.

sdf_ad_placement_data = sdf_parse_json_string.select(
    col("source_json_parsed.ID").alias("ID"),
    col("source_json_parsed.device").alias("device"),
    explode(col("source_json_parsed.Ads")).alias("ads")
).select(
    col("ID"),
    col("device"),
    col("ads.placement").alias("placement"),
    explode(col("ads.Adlist")).alias("ad")
).select(
    col("ID"),
    col("device"),
    col("placement"),
    col("ad.name").alias("ad_name"),
    col("ad.subtype").alias("ad_subtype"),
    col("ad.category").alias("ad_category")
)

sdf_ad_placement_data.display()

Now, I'm coding with the assumption the JSON payload has two objects in the Ads array and each object has three adds a piece. If this is more dynamic, you'd need to account for that, but you see how I transform the JSON into a dataframe for subsequent processing.

Collectives™ on Stack Overflow

Pyspark - Resolve nested json file into multiple columns using inbuilt spark functions

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related