I am stuck trying to extract columns from a list of lists but can't visualize how to do it. I am fairly new to spark. Running pyspark on Spark 2.4.3.
I have a json organized like this:
{ "meta" : { ... },
"data" :
[[ "a", 0, null, "{ }"],
[ "b", 0, null, "{ }"],
[ "c", 0, null, "{ }"],
] }
I want to get the 'data' portion into columns, like
+------+------+------+------+
| col1 | col2 | col3 | col4 |
+------+------+------+------+
| a | 0 | None | "{ }"|
| b | 0 | None | "{ }"|
| c | 0 | None | "{ }"|
I have my dataframe read in and printSchema() shows this.
root
|-- data: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: string (containsNull = true)
|-- meta: struct (nullable = true)
| |-- view: struct (nullable = true)
| | |-- approvals: array (nullable = true) ...
My rough shape is 70 columns by 650k rows.
I was able to explode the df to get just the data portion but am stuck there.
getItemfunction as shown here. stackoverflow.com/questions/47874037/…