1

I want to parse a JSON request and create multiple columns out of it in pyspark as follows:

{
  "ID": "abc123",
  "device": "mobile",
  "Ads": [
    {
      "placement": "topright",
      "Adlist": [
        {
          "name": "ad1",
          "subtype": "placeholder1",
          "category": "socialmedia",
        },
        {
          "name": "ad2",
          "subtype": "placeholder2",
          "category": "media",
        },
        {
          "name": "ad3",
          "subtype": "placeholder3",
          "category": "printingpress",
        }
      ]
    },
    {
      "Placement": "bottomleft",
      "Adlist": [
        {
          "name": "ad4",
          "subtype": "placeholder4",
          "category": "socialmedia",
        },
        {
          "name": "ad5",
          "subtype": "placeholder5",
          "category": "media",
        },
        {
          "name": "ad6",
          "subtype": "placeholder6",
          "category": "printingpress",
        }
      ]
    }
  ]
}

I tried the following:

df = spark.read.option("multiline", "true").json(json_file_location)
df_schema = df.schema
exploded_df = df.withColumn("data", from_json("data", df_schema)).select(col('data.*'))

But getting the error

[Datatype_mismatch.unexpected_input_type] can not resolve "from_json(data)" due to data type mismatch" parameter 1 requires the "STRING" type, however "data" has the type "ARRAY<ARRAY<STRING>>"

Schema look something similar but much larger and nested than this. This is just a smaller example

root
 |-- Ads: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Adlist: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- category: string (nullable = true)
 |    |    |    |    |-- name: string (nullable = true)
 |    |    |    |    |-- subtype: string (nullable = true)
 |    |    |-- placement: string (nullable = true)
 |-- ID: string (nullable = true)
 |-- device: string (nullable = true)

output I am looking for is as follows:

Id    | device | placement | name  | subtype      | category      |
-------------------------------------------------------------------
abc123| mobile | topright  | ad1   | placeholder1 | socialmedia   |
abc123| mobile | topright  | ad2   | placeholder2 | media         |
abc123| mobile | topright  | ad3   | placeholder3 | printingpress |
abc123| mobile |bottomleft | ad4   | placeholder4 | socialmedia   |
abc123| mobile |bottomleft | ad5   | placeholder5 | media         |
abc123| mobile |bottomleft | ad6   | placeholder6 | printingpress |

Is there a way to achieve it using in built pyspark sql fucntion without hard coding schema (Actual schema is much larger, complex and nested multilevel) There are two parents column

  1. Data (Array Type) - Multilevel Nesting
  2. details (Struct Type) - - Multilevel Nesting I only need to flatten it using pyspark inbuilt function without soecifying the explicit schema

1 Answer 1

0

Hmm. At first blush, I wonder if its the nested array in the JSON string?

Let's say your JSON string is in a variable named source_json_string. I'd stick the whole thing in a Spark dataframe:

from pyspark.sql.functions import *

sdf_raw = spark.createDataFrame([source_json_string], StringType()).toDF("source_json_string")

sdf_raw.display()

Then I'd parse the JSON string.

sdf_parse_json_string = sdf_raw.select(
    from_json(col("source_json_string"), 
              schema="ID string, device string, Ads array<struct<placement:string, Adlist:array<struct<name:string, subtype:string, category:string>>>>")
    .alias("source_json_parsed")
)

sdf_parse_json_string.display()

Did you see what I id there to account for the arrays in your source data?

Then I'd tease out the individual fields from the nested JSON.

sdf_ad_placement_data = sdf_parse_json_string.select(
    col("source_json_parsed.ID").alias("ID"),
    col("source_json_parsed.device").alias("device"),
    explode(col("source_json_parsed.Ads")).alias("ads")
).select(
    col("ID"),
    col("device"),
    col("ads.placement").alias("placement"),
    explode(col("ads.Adlist")).alias("ad")
).select(
    col("ID"),
    col("device"),
    col("placement"),
    col("ad.name").alias("ad_name"),
    col("ad.subtype").alias("ad_subtype"),
    col("ad.category").alias("ad_category")
)

sdf_ad_placement_data.display()

Now, I'm coding with the assumption the JSON payload has two objects in the Ads array and each object has three adds a piece. If this is more dynamic, you'd need to account for that, but you see how I transform the JSON into a dataframe for subsequent processing.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.