I'm quite new to pyspark and I have a dataframe that currently looks like below.
| col1 | col2 |
+---------------------------------+-------------------+
| [(a, 0)], [(b,0)].....[(z,1)] | [0, 0, ... 1] |
| [(b, 0)], [(b,1)].....[(z,0)] | [0, 1, ... 0] |
| [(a, 0)], [(c, 1)].....[(z,0)] | [0, 1, ... 0] |
I extracted values from col1.QueryNum into col2 and when I print the schema, it's an array containing the list of number from col1.QueryNum.
Ultimately my goal is to convert the list values in col2 into struct format inside pyspark(refer to desired schema).
Current Schema
|-- col1: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- types: string (nullable = true)
| | |-- QueryNum: integer (nullable = true)
|-- col2: array (nullable = true)
| |-- element: integer (containsNull = true)
Desired Schema
|-- col2: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- val1: integer (nullable = true)
| | |-- val2: integer (nullable = true)
.
.
.
| | |-- val80: integer (nullable = true)
I tried using from_json and it's not really working.
val1andval2? how many elements do you have incol1? It's not clear in your examplecol2. what if you have 3 elements in thecol1would you addval3in struct ofcol2then?