Converting a column of Array of Array to Array type in Scala/Spark

Question

I have a data frame which looks like below:

+---+-----+--------------------------------------------------------------------------------------------------+------+
|uid|label|features                                                                                          |weight|
+---+-----+--------------------------------------------------------------------------------------------------+------+
|1  |1.0  |[WrappedArray([animal_indexed,2.0,animal_indexed]), WrappedArray([talk_indexed,3.0,talk_indexed])]|1     |
|2  |0.0  |[WrappedArray([animal_indexed,1.0,animal_indexed]), WrappedArray([talk_indexed,2.0,talk_indexed])]|1     |
|3  |1.0  |[WrappedArray([animal_indexed,0.0,animal_indexed]), WrappedArray([talk_indexed,1.0,talk_indexed])]|1     |
|4  |2.0  |[WrappedArray([animal_indexed,0.0,animal_indexed]), WrappedArray([talk_indexed,0.0,talk_indexed])]|1     |
+---+-----+--------------------------------------------------------------------------------------------------+------+

and the schema is

root
 |-- uid: integer (nullable = false)
 |-- label: double (nullable = false)
 |-- features: array (nullable = false)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- value: double (nullable = false)
 |    |    |    |-- term: string (nullable = true)
 |-- weight: integer (nullable = false)

But I want to convert the features from Array[Array] to just Array i.e. flatMap a column array into the same column to get a schema like

  root
     |-- uid: integer (nullable = false)
     |-- label: double (nullable = false)
     |-- features: array (nullable = false)
     |    |    |-- element: struct (containsNull = true)
     |    |    |    |-- name: string (nullable = true)
     |    |    |    |-- value: double (nullable = false)
     |    |    |    |-- term: string (nullable = true)
     |-- weight: integer (nullable = false)

Thanks in advance.

I dont want to get multiple rows from same row. Rather remove the extra nesting in the dataFrame i.e from Array[Array[]] to Array[] using something similar to flatten operation on array. — Global Warrior
– Global Warrior, Commented Sep 8, 2018 at 17:15
Could you try the flatten function on your array? All scala collections have this function. — PJ Fanning
– PJ Fanning, Commented Sep 8, 2018 at 17:33
@PJFanning can you give an example for a sample dataframe column and converting it from Array[Array] to Array type? — Global Warrior
– Global Warrior, Commented Sep 8, 2018 at 18:01

illak zapata · Accepted Answer · 2018-09-14 19:18:00Z

1

You should read your data as a Dataset with schema:

case class Something(name: String, value: Double, term: String)
case class MyClass(uid: Int, label: Double, array: Seq[Seq[Something]], weight: Int)

then use UDF like this:

val flatUDF = udf((list: Seq[Seq[Something]]) => list.flatten)

val flattedDF = myDataFrame.withColumn("flatten", flatUDF($"features"))

example for reading dataset:

val myDataFrame = spark.read.json(path).as[MyClass]

Hope this helps.

edited Sep 14, 2018 at 19:18

answered Sep 14, 2018 at 19:10

illak zapata

662 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Converting a column of Array of Array to Array type in Scala/Spark

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related