2

How to parse JSON string of nested lists to spark data frame in pyspark ?

Input data frame:

+-------------+-----------------------------------------------+
|url          |json                                           |
+-------------+-----------------------------------------------+
|https://url.a|[[1572393600000, 1.000],[1572480000000, 1.007]]|
|https://url.b|[[1572825600000, 1.002],[1572912000000, 1.000]]|
+-------------+-----------------------------------------------+

root
 |-- url: string (nullable = true)
 |-- json: string (nullable = true)

Expected output:

+---------------------------------------+
|col_1 | col_2               | col_3    |
+---------------------------------------+
| a    | 1572393600000       |  1.000   | 
| a    | 1572480000000       |  1.007   |
| b    | 1572825600000       |  1.002   |
| b    | 1572912000000       |  1.000   |
+---------------------------------------+

Example code:

import pyspark
import pyspark.sql.functions as F

spark = (pyspark.sql.SparkSession.builder.appName("Downloader_standalone")
    .master('local[*]')
    .getOrCreate())

sc = spark.sparkContext
from pyspark.sql import Row

rdd_list  = [('https://url.a','[[1572393600000, 1.000],[1572480000000, 1.007]]'),
             ('https://url.b','[[1572825600000, 1.002],[1572912000000, 1.000]]')]

jsons = sc.parallelize(rdd_list) 

df = spark.createDataFrame(jsons, "url string, json string")
df.show(truncate=False)
df.printSchema()


(df.withColumn('json', F.from_json(F.col('json'),"array<string,string>"))
.select(F.explode('json').alias('col_1', 'col_2', 'col_3')).show())

There are few examples, but I can not figure out how to do it:

3 Answers 3

2

With some replacements in the strings and by splitting you can get the desired result:

from pyspark.sql import functions as F

df1 = df.withColumn(
    "col_1",
    F.regexp_replace("url", "https://url.", "")
).withColumn(
    "col_2_3",
    F.explode(
        F.expr("""transform(
            split(trim(both '][' from json), '\\\],\\\['), 
            x -> struct(split(x, ',')[0] as col_2, split(x, ',')[1] as col_3)
        )""")
    )
).selectExpr("col_1", "col_2_3.*")

df1.show(truncate=False)

#+-----+-------------+------+
#|col_1|col_2        |col_3 |
#+-----+-------------+------+
#|a    |1572393600000| 1.000|
#|a    |1572480000000| 1.007|
#|b    |1572825600000| 1.002|
#|b    |1572912000000| 1.000|
#+-----+-------------+------+

Explanation:

  1. trim(both '][' from json) : removes trailing and leading caracters [ and ], get someting like: 1572393600000, 1.000],[1572480000000, 1.007

  2. Now you can split by ],[ (\\\ is for escaping the brackets)

  3. transform takes the array from the split and for each element, it splits by comma and creates struct col_2 and col_3

  4. explode the array of structs you get from the transform and star expand the struct column

Sign up to request clarification or add additional context in comments.

4 Comments

thanks for you answer, for some reason I get exception, any idea how to solve it ? AnalysisException: Can only star expand struct data types. Attribute: `ArrayBuffer(json)`;
That works well, could you also help what F.expr("""transform does , what is "'\\\" etc ...
Last thing if you could help, if there is url like "url&id=ID1234]2]0]" and I would like to extract "ID1234" (id has always 6 characters), any idea how to do it?
@Dan you can use regexp_extract function : F.regexp_extract(F.col("url"), ".+(ID\\d{4}).+", 1)
2
df.select(df.url, F.explode(F.from_json(df.json,"array<string>")))
.select("url",F.from_json((F.col("col")),"array<string>").alias("col"))
.select("url",F.col("col").getItem(0),F.col("col").getItem(1))
.show(truncate=False)

+-------------+-------------+------+
|url          |col[0]       |col[1]|
+-------------+-------------+------+
|https://url.a|1572393600000|1.0   |
|https://url.a|1572480000000|1.007 |
|https://url.b|1572825600000|1.002 |
|https://url.b|1572912000000|1.0   |
+-------------+-------------+------+

1 Comment

Thanks, there is in this case always only 2 values in each pair. This works well!
0
df.select(col("url"), 
          explode(from_json(col("json"), ArrayType(StringType))))      
  .select(col("url"), 
          from_json(col("col"), ArrayType(StringType)).alias("col"))      
  .select(col("url"),
          col("col").getItem(0).alias("code"),        
          col("col").getItem(1).alias("value"))

1 Comment

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.