0

Suppose I have a dataframe like this

val customer = Seq(
    ("C1", "Jackie Chan", 50, "Dayton", "M"),
    ("C2", "Harry Smith", 30, "Beavercreek", "M"),
    ("C3", "Ellen Smith", 28, "Beavercreek", "F"),
    ("C4", "John Chan", 26, "Dayton","M")
  ).toDF("cid","name","age","city","sex")

How can i get cid values in one column and get the rest of the values in an array < struct < column_name, column_value > > in spark

3 Answers 3

5

The only difficulty is that arrays must contain elements of the same type. Therefore, you need to cast all the columns to strings before putting them in an array (age is an int in your case). Here is how it goes:

val cols = customer.columns.tail
val result = customer.select('cid,
    array(cols.map(c => struct(lit(c) as "name", col(c) cast "string" as "value")) : _*) as "array")

result.show(false)

+---+-----------------------------------------------------------+
|cid|array                                                      |
+---+-----------------------------------------------------------+
|C1 |[[name,Jackie Chan], [age,50], [city,Dayton], [sex,M]]     |
|C2 |[[name,Harry Smith], [age,30], [city,Beavercreek], [sex,M]]|
|C3 |[[name,Ellen Smith], [age,28], [city,Beavercreek], [sex,F]]|
|C4 |[[name,John Chan], [age,26], [city,Dayton], [sex,M]]       |
+---+-----------------------------------------------------------+

result.printSchema()

root
 |-- cid: string (nullable = true)
 |-- array: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- name: string (nullable = false)
 |    |    |-- value: string (nullable = true)
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks @Oli, that's exactly what I needed.
2

You can do it using array and struct functions:

customer.select($"cid", array(struct(lit("name") as "column_name", $"name" as "column_value"), struct(lit("age") as "column_name", $"age" as "column_value") ))

will make:

 |-- cid: string (nullable = true)
 |-- array(named_struct(column_name, name AS `column_name`, NamePlaceholder(), name AS `column_value`), named_struct(column_name, age AS `column_name`, NamePlaceholder(), age AS `column_value`)): array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- column_name: string (nullable = false)
 |    |    |-- column_value: string (nullable = true)

Comments

1

Map columns might be a better way to deal with the overall problem. You can keep different value types in the same map, without having to cast it to string.

df.select('cid',
    create_map(lit("name"), col("name"), lit("age"), col("age"),
               lit("city"), col("city"), lit("sex"),col("sex")
               ).alias('map_col')
  )

or wrap the map col in an array if you want it

This way you can still do numerical or string transformations on the relevant key or value. For example:

df.select('cid',
    create_map(lit("name"), col("name"), lit("age"), col("age"),
               lit("city"), col("city"), lit("sex"),col("sex")
               ).alias('map_col')
  )
df.select('*', 
      map_concat( col('cid'), create_map(lit('u_age'),when(col('map_col')['age'] < 18, True)))
)

Hope that makes sense, typed this straight in here so forgive if there's a bracket missing somewhere

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.