Converting a dataframe to an array of struct of column names and values

Question

Suppose I have a dataframe like this

val customer = Seq(
    ("C1", "Jackie Chan", 50, "Dayton", "M"),
    ("C2", "Harry Smith", 30, "Beavercreek", "M"),
    ("C3", "Ellen Smith", 28, "Beavercreek", "F"),
    ("C4", "John Chan", 26, "Dayton","M")
  ).toDF("cid","name","age","city","sex")

How can i get cid values in one column and get the rest of the values in an array < struct < column_name, column_value > > in spark

Oli · Accepted Answer · 2019-05-10 14:24:44Z

5

The only difficulty is that arrays must contain elements of the same type. Therefore, you need to cast all the columns to strings before putting them in an array (age is an int in your case). Here is how it goes:

val cols = customer.columns.tail
val result = customer.select('cid,
    array(cols.map(c => struct(lit(c) as "name", col(c) cast "string" as "value")) : _*) as "array")

result.show(false)

+---+-----------------------------------------------------------+
|cid|array                                                      |
+---+-----------------------------------------------------------+
|C1 |[[name,Jackie Chan], [age,50], [city,Dayton], [sex,M]]     |
|C2 |[[name,Harry Smith], [age,30], [city,Beavercreek], [sex,M]]|
|C3 |[[name,Ellen Smith], [age,28], [city,Beavercreek], [sex,F]]|
|C4 |[[name,John Chan], [age,26], [city,Dayton], [sex,M]]       |
+---+-----------------------------------------------------------+

result.printSchema()

root
 |-- cid: string (nullable = true)
 |-- array: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- name: string (nullable = false)
 |    |    |-- value: string (nullable = true)

edited May 10, 2019 at 14:24

answered May 10, 2019 at 14:04

Oli

10.5k5 gold badges31 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Srinivas Over a year ago

Thanks @Oli, that's exactly what I needed.

Michel Lemay · Accepted Answer · 2019-05-10 14:03:41Z

2

You can do it using array and struct functions:

customer.select($"cid", array(struct(lit("name") as "column_name", $"name" as "column_value"), struct(lit("age") as "column_name", $"age" as "column_value") ))

will make:

 |-- cid: string (nullable = true)
 |-- array(named_struct(column_name, name AS `column_name`, NamePlaceholder(), name AS `column_value`), named_struct(column_name, age AS `column_name`, NamePlaceholder(), age AS `column_value`)): array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- column_name: string (nullable = false)
 |    |    |-- column_value: string (nullable = true)

answered May 10, 2019 at 14:03

Michel Lemay

2,0942 gold badges18 silver badges35 bronze badges

Comments

Topde · Accepted Answer · 2020-03-10 13:40:39Z

Map columns might be a better way to deal with the overall problem. You can keep different value types in the same map, without having to cast it to string.

df.select('cid',
    create_map(lit("name"), col("name"), lit("age"), col("age"),
               lit("city"), col("city"), lit("sex"),col("sex")
               ).alias('map_col')
  )

or wrap the map col in an array if you want it

This way you can still do numerical or string transformations on the relevant key or value. For example:

df.select('cid',
    create_map(lit("name"), col("name"), lit("age"), col("age"),
               lit("city"), col("city"), lit("sex"),col("sex")
               ).alias('map_col')
  )
df.select('*', 
      map_concat( col('cid'), create_map(lit('u_age'),when(col('map_col')['age'] < 18, True)))
)

Hope that makes sense, typed this straight in here so forgive if there's a bracket missing somewhere

Collectives™ on Stack Overflow

Converting a dataframe to an array of struct of column names and values

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related