Extract columns from a list of lists in pyspark

Question

I am stuck trying to extract columns from a list of lists but can't visualize how to do it. I am fairly new to spark. Running pyspark on Spark 2.4.3.

I have a json organized like this:

{ "meta" : { ... },
  "data" : 
  [[ "a", 0, null, "{ }"],
   [ "b", 0, null, "{ }"],
   [ "c", 0, null, "{ }"],
   ] }

I want to get the 'data' portion into columns, like

 +------+------+------+------+
 | col1 | col2 | col3 | col4 |
 +------+------+------+------+
 |   a  |   0  | None | "{ }"|
 |   b  |   0  | None | "{ }"|
 |   c  |   0  | None | "{ }"|

I have my dataframe read in and printSchema() shows this.

root
 |-- data: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)
 |-- meta: struct (nullable = true)
 |    |-- view: struct (nullable = true)
 |    |    |-- approvals: array (nullable = true) ...

My rough shape is 70 columns by 650k rows.

I was able to explode the df to get just the data portion but am stuck there.

You can access the individual elements of the array through getItem function as shown here. stackoverflow.com/questions/47874037/… — user238607
– user238607, Commented Dec 18, 2020 at 5:34

mck · Accepted Answer · 2020-12-18 06:11:22Z

2

Explode the rows first, and then select the array elements using [] in Python.

df2 = df.select(F.explode('data').alias('data')) \
        .select(*[F.col('data')[i].alias('col%s'%(i+1)) for i in range(4)])

df2.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|   a|   0|null| { }|
|   b|   0|null| { }|
|   c|   0|null| { }|
+----+----+----+----+

answered Dec 18, 2020 at 6:11

mck

42.7k13 gold badges44 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

sagmansercan · Accepted Answer · 2020-12-18 07:56:04Z

1

Why don't you use just SparkSession.createDataFrame() method?

https://spark.apache.org/docs/2.4.3/api/python/pyspark.sql.html#pyspark.sql.SparkSession.createDataFrame

You can provide data and schema parameters to this method and get spark dataframe.

Example:

from pyspark.sql import SparkSession

sparkSession = SparkSession.builder.getOrCreate()
df = sparkSession.createDataFrame(data)

If spark cannot infer schema from the data then schema also need to be provided

from pyspark.sql.types import StructType

struct = StructType()
struct.add("col1", "string", True)
struct.add("col2", "integer", True)
struct.add("col3", "string", True)
struct.add("col4", "string", True)


df = sparkSession.createDataFrame(data=data, schema=struct)

In addition, you can use pyspark type classes instead of python primitive type names. https://spark.apache.org/docs/2.4.3/api/python/pyspark.sql.html#module-pyspark.sql.types

module contains both simple types(StringType, IntegerType, ...) and complex types(ArrayType, MapType, ...)

Last note: data cannot contain null, it should be None in python. spark DataFrame.show() will print None columns as null.

answered Dec 18, 2020 at 7:56

sagmansercan

1014 bronze badges

1 Comment

IronYeti Over a year ago

I didn't want to specify schema as I have 70 columns.

Collectives™ on Stack Overflow

Extract columns from a list of lists in pyspark

2 Answers 2

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related