0

I have a dataframe in PySpark with a string column with value [{"AppId":"APACON","ExtId":"141730"}] (the string is exactly like that in my column, it is a string, not an array)

I want to convert this to an array of struct.

Can I do that simply with native spark function or do I have to parse the string or use UDF ?

sqlContext.createDataFrame(
    [ (1,'[{"AppId":"APACON","ExtId":"141730"}]'),
      (2,'[{"AppId":"APACON","ExtId":"141793"}]'),
    ],
    ['idx','txt']
).show()

+---+--------------------+
|idx|                 txt|
+---+--------------------+
|  1|[{"AppId":"APACON...|
|  2|[{"AppId":"APACON...|
+---+--------------------+
5
  • You need from_json Commented Jun 13, 2018 at 13:40
  • not available in 2.0 ... any alternative ? Commented Jun 13, 2018 at 13:46
  • You could try pyspark.sql.functions.get_json_object which will parse the txt column and create one column per field with associated values Commented Jun 13, 2018 at 13:57
  • @plalanne its parses the json, but I do not know how i can access the elements of the list. Commented Jun 13, 2018 at 14:15
  • I edited my answer and proposed a regexp to ignore square brackets in df.txt as a first step . Let me know if this helps Commented Jun 13, 2018 at 14:36

1 Answer 1

1

With Spark 2.1 or above

You have the following data :

import pyspark.sql.functions as F
from pyspark.sql.types import *

df = sqlContext.createDataFrame(
    [ (1,'[{"AppId":"APACON","ExtId":"141730"}]'),
      (2,'[{"AppId":"APACON","ExtId":"141793"}]'),
    ],
    ['idx','txt']
)

you can indeed use pyspark.sql.functions.from_json as follows :

schema = StructType([StructField("AppId", StringType()),
                     StructField("ExtId", StringType())])
df = df.withColumn('array',F.from_json(F.col('txt'), schema))
df.show()

+---+--------------------+---------------+
|idx|                 txt|          array|
+---+--------------------+---------------+
|  1|[{"AppId":"APACON...|[APACON,141730]|
|  2|[{"AppId":"APACON...|[APACON,141793]|
+---+--------------------+---------------+


Version < Spark 2.1

One way to bypass the issue, would be to first slightly modify your input string to have :

# Use regexp_extract to ignore square brackets
df.withColumn('txt_parsed',F.regexp_extract(F.col('txt'),'[^\\[\\]]+',0))
df.show()

+---+-------------------------------------+-----------------------------------+
|idx|txt                                  |txt_parsed                         |
+---+-------------------------------------+-----------------------------------+
|1  |[{"AppId":"APACON","ExtId":"141730"}]|{"AppId":"APACON","ExtId":"141730"}|
|2  |[{"AppId":"APACON","ExtId":"141793"}]|{"AppId":"APACON","ExtId":"141793"}|
+---+-------------------------------------+-----------------------------------+

Then you could use pyspark.sql.functions.get_json_object to parse the txt column

df = df.withColumn('AppId', F.get_json_object(df.txt, '$.AppId'))
df = df.withColumn('ExtId', F.get_json_object(df.txt, '$.ExtId'))
df.show()


+---+--------------------+------+------+
|idx|                 txt| AppId| ExtId|
+---+--------------------+------+------+
|  1|{"AppId":"APACON"...|APACON|141730|
|  2|{"AppId":"APACON"...|APACON|141793|
+---+--------------------+------+------+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.