convert array to struct pyspark

Question

I'm quite new to pyspark and I have a dataframe that currently looks like below.

| col1                            | col2              |
+---------------------------------+-------------------+
| [(a, 0)], [(b,0)].....[(z,1)]   | [0, 0, ... 1]     |
| [(b, 0)], [(b,1)].....[(z,0)]   | [0, 1, ... 0]     |
| [(a, 0)], [(c, 1)].....[(z,0)]  | [0, 1, ... 0]     |

I extracted values from col1.QueryNum into col2 and when I print the schema, it's an array containing the list of number from col1.QueryNum.

Ultimately my goal is to convert the list values in col2 into struct format inside pyspark(refer to desired schema).

Current Schema

 |-- col1: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- types: string (nullable = true)
 |    |    |-- QueryNum: integer (nullable = true)
 |-- col2: array (nullable = true)
 |    |-- element: integer (containsNull = true)

Desired Schema

 |-- col2: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- val1: integer (nullable = true)
 |    |    |-- val2: integer (nullable = true)
                 .
                 .
                 .
 |    |    |-- val80: integer (nullable = true)

I tried using from_json and it's not really working.

And what is val1 and val2? how many elements do you have in col1? It's not clear in your example — blackbishop
– blackbishop, Commented Jan 23, 2022 at 14:41
sorry I can't understand why you want to have array of structs instead of simple array of values in col2. what if you have 3 elements in the col1 would you add val3 in struct of col2 then? — blackbishop
– blackbishop, Commented Jan 23, 2022 at 14:53
the reason is I need to assign a name for each of the val1, val2 ... val80, this is for convenience in the following data processing step. The number of elements are fixed, there are 80 elements in total. — jynn
– jynn, Commented Jan 23, 2022 at 14:55

blackbishop · Accepted Answer · 2022-01-23 15:21:26Z

4

If the you have fixed array size you can create struct using list-comprehension:

from pyspark.sql import functions as F

df1 = df.withColumn(
    "col2",
    F.array(
        F.struct(*[
            F.col("col1")[i]["QueryNum"].alias(f"val{i+1}") for i in range(2)
        ])
    )
)

df1.show()
#+----------------+--------+
#|col1            |col2    |
#+----------------+--------+
#|[[0, a], [0, b]]|[[0, 0]]|
#|[[0, b], [1, b]]|[[0, 1]]|
#|[[0, a], [1, c]]|[[0, 1]]|
#+----------------+--------+

df1.printSchema()
#root
#|-- col1: array (nullable = true)
#|    |-- element: struct (containsNull = true)
#|    |    |-- QueryNum: long (nullable = true)
#|    |    |-- types: string (nullable = true)
#|-- col2: array (nullable = false)
#|    |-- element: struct (containsNull = false)
#|    |    |-- val1: long (nullable = true)
#|    |    |-- val2: long (nullable = true)

Note however that there is no need to use array in this case as you'll always have one struct in that array. Just use simple struct:

df1 = df.withColumn(
    "col2",
    F.struct(*[
        F.col("col1")[i]["QueryNum"].alias(f"val{i+1}") for i in range(2)
    ])
)

Or if you prefer a map type:

df1 = df.withColumn(
    "col2",
    F.map_from_entries(
        F.expr("transform(col1, (x,i) -> struct('val' || (i+1) as name, x.QueryNum as value))")
    )
)

answered Jan 23, 2022 at 15:21

blackbishop

32.8k11 gold badges61 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

jynn Over a year ago

thanks alot! I manage to did a similar one like the simple struct method, it is just the one im looking for!

Collectives™ on Stack Overflow

convert array to struct pyspark

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related