2

I am stuck trying to extract columns from a list of lists but can't visualize how to do it. I am fairly new to spark. Running pyspark on Spark 2.4.3.

I have a json organized like this:

{ "meta" : { ... },
  "data" : 
  [[ "a", 0, null, "{ }"],
   [ "b", 0, null, "{ }"],
   [ "c", 0, null, "{ }"],
   ] }

I want to get the 'data' portion into columns, like

 +------+------+------+------+
 | col1 | col2 | col3 | col4 |
 +------+------+------+------+
 |   a  |   0  | None | "{ }"|
 |   b  |   0  | None | "{ }"|
 |   c  |   0  | None | "{ }"|

I have my dataframe read in and printSchema() shows this.

root
 |-- data: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)
 |-- meta: struct (nullable = true)
 |    |-- view: struct (nullable = true)
 |    |    |-- approvals: array (nullable = true) ...

My rough shape is 70 columns by 650k rows.

I was able to explode the df to get just the data portion but am stuck there.

1

2 Answers 2

2

Explode the rows first, and then select the array elements using [] in Python.

df2 = df.select(F.explode('data').alias('data')) \
        .select(*[F.col('data')[i].alias('col%s'%(i+1)) for i in range(4)])

df2.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|   a|   0|null| { }|
|   b|   0|null| { }|
|   c|   0|null| { }|
+----+----+----+----+
Sign up to request clarification or add additional context in comments.

Comments

1

Why don't you use just SparkSession.createDataFrame() method?

https://spark.apache.org/docs/2.4.3/api/python/pyspark.sql.html#pyspark.sql.SparkSession.createDataFrame

You can provide data and schema parameters to this method and get spark dataframe.

Example:

from pyspark.sql import SparkSession

sparkSession = SparkSession.builder.getOrCreate()
df = sparkSession.createDataFrame(data)

If spark cannot infer schema from the data then schema also need to be provided

from pyspark.sql.types import StructType

struct = StructType()
struct.add("col1", "string", True)
struct.add("col2", "integer", True)
struct.add("col3", "string", True)
struct.add("col4", "string", True)


df = sparkSession.createDataFrame(data=data, schema=struct)

In addition, you can use pyspark type classes instead of python primitive type names. https://spark.apache.org/docs/2.4.3/api/python/pyspark.sql.html#module-pyspark.sql.types

module contains both simple types(StringType, IntegerType, ...) and complex types(ArrayType, MapType, ...)

Last note: data cannot contain null, it should be None in python. spark DataFrame.show() will print None columns as null.

1 Comment

I didn't want to specify schema as I have 70 columns.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.