Spark: How to parse JSON string of nested lists to spark data frame?

Question

How to parse JSON string of nested lists to spark data frame in pyspark ?

Input data frame:

+-------------+-----------------------------------------------+
|url          |json                                           |
+-------------+-----------------------------------------------+
|https://url.a|[[1572393600000, 1.000],[1572480000000, 1.007]]|
|https://url.b|[[1572825600000, 1.002],[1572912000000, 1.000]]|
+-------------+-----------------------------------------------+

root
 |-- url: string (nullable = true)
 |-- json: string (nullable = true)

Expected output:

+---------------------------------------+
|col_1 | col_2               | col_3    |
+---------------------------------------+
| a    | 1572393600000       |  1.000   | 
| a    | 1572480000000       |  1.007   |
| b    | 1572825600000       |  1.002   |
| b    | 1572912000000       |  1.000   |
+---------------------------------------+

Example code:

import pyspark
import pyspark.sql.functions as F

spark = (pyspark.sql.SparkSession.builder.appName("Downloader_standalone")
    .master('local[*]')
    .getOrCreate())

sc = spark.sparkContext
from pyspark.sql import Row

rdd_list  = [('https://url.a','[[1572393600000, 1.000],[1572480000000, 1.007]]'),
             ('https://url.b','[[1572825600000, 1.002],[1572912000000, 1.000]]')]

jsons = sc.parallelize(rdd_list) 

df = spark.createDataFrame(jsons, "url string, json string")
df.show(truncate=False)
df.printSchema()


(df.withColumn('json', F.from_json(F.col('json'),"array<string,string>"))
.select(F.explode('json').alias('col_1', 'col_2', 'col_3')).show())

There are few examples, but I can not figure out how to do it:

blackbishop · Accepted Answer · 2021-02-08 22:28:52Z

2

With some replacements in the strings and by splitting you can get the desired result:

from pyspark.sql import functions as F

df1 = df.withColumn(
    "col_1",
    F.regexp_replace("url", "https://url.", "")
).withColumn(
    "col_2_3",
    F.explode(
        F.expr("""transform(
            split(trim(both '][' from json), '\\\],\\\['), 
            x -> struct(split(x, ',')[0] as col_2, split(x, ',')[1] as col_3)
        )""")
    )
).selectExpr("col_1", "col_2_3.*")

df1.show(truncate=False)

#+-----+-------------+------+
#|col_1|col_2        |col_3 |
#+-----+-------------+------+
#|a    |1572393600000| 1.000|
#|a    |1572480000000| 1.007|
#|b    |1572825600000| 1.002|
#|b    |1572912000000| 1.000|
#+-----+-------------+------+

Explanation:

trim(both '][' from json) : removes trailing and leading caracters [ and ], get someting like: 1572393600000, 1.000],[1572480000000, 1.007
Now you can split by ],[ (\\\ is for escaping the brackets)
transform takes the array from the split and for each element, it splits by comma and creates struct col_2 and col_3
explode the array of structs you get from the transform and star expand the struct column

edited Feb 8, 2021 at 22:28

answered Feb 8, 2021 at 20:16

blackbishop

32.8k11 gold badges61 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Dan Over a year ago

thanks for you answer, for some reason I get exception, any idea how to solve it ? AnalysisException: Can only star expand struct data types. Attribute: `ArrayBuffer(json)`;

Dan Over a year ago

That works well, could you also help what F.expr("""transform does , what is "'\\\" etc ...

Dan Over a year ago

Last thing if you could help, if there is url like "url&id=ID1234]2]0]" and I would like to extract "ID1234" (id has always 6 characters), any idea how to do it?

blackbishop Over a year ago

@Dan you can use regexp_extract function : F.regexp_extract(F.col("url"), ".+(ID\\d{4}).+", 1)

chlebek · Accepted Answer · 2021-02-08 20:39:58Z

2

df.select(df.url, F.explode(F.from_json(df.json,"array<string>")))
.select("url",F.from_json((F.col("col")),"array<string>").alias("col"))
.select("url",F.col("col").getItem(0),F.col("col").getItem(1))
.show(truncate=False)

+-------------+-------------+------+
|url          |col[0]       |col[1]|
+-------------+-------------+------+
|https://url.a|1572393600000|1.0   |
|https://url.a|1572480000000|1.007 |
|https://url.b|1572825600000|1.002 |
|https://url.b|1572912000000|1.0   |
+-------------+-------------+------+

answered Feb 8, 2021 at 20:39

chlebek

2,4511 gold badge10 silver badges20 bronze badges

1 Comment

Dan Over a year ago

Thanks, there is in this case always only 2 values in each pair. This works well!

Reza Hasani · Accepted Answer · 2023-08-09 22:01:18Z

0

df.select(col("url"), 
          explode(from_json(col("json"), ArrayType(StringType))))      
  .select(col("url"), 
          from_json(col("col"), ArrayType(StringType)).alias("col"))      
  .select(col("url"),
          col("col").getItem(0).alias("code"),        
          col("col").getItem(1).alias("value"))

answered Aug 9, 2023 at 22:01

Reza Hasani

11 bronze badge

1 Comment

Community Over a year ago

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.

Collectives™ on Stack Overflow

Spark: How to parse JSON string of nested lists to spark data frame?

3 Answers 3

4 Comments

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related