0

I have a data frame which looks like below:

+---+-----+--------------------------------------------------------------------------------------------------+------+
|uid|label|features                                                                                          |weight|
+---+-----+--------------------------------------------------------------------------------------------------+------+
|1  |1.0  |[WrappedArray([animal_indexed,2.0,animal_indexed]), WrappedArray([talk_indexed,3.0,talk_indexed])]|1     |
|2  |0.0  |[WrappedArray([animal_indexed,1.0,animal_indexed]), WrappedArray([talk_indexed,2.0,talk_indexed])]|1     |
|3  |1.0  |[WrappedArray([animal_indexed,0.0,animal_indexed]), WrappedArray([talk_indexed,1.0,talk_indexed])]|1     |
|4  |2.0  |[WrappedArray([animal_indexed,0.0,animal_indexed]), WrappedArray([talk_indexed,0.0,talk_indexed])]|1     |
+---+-----+--------------------------------------------------------------------------------------------------+------+

and the schema is

root
 |-- uid: integer (nullable = false)
 |-- label: double (nullable = false)
 |-- features: array (nullable = false)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- name: string (nullable = true)
 |    |    |    |-- value: double (nullable = false)
 |    |    |    |-- term: string (nullable = true)
 |-- weight: integer (nullable = false)

But I want to convert the features from Array[Array] to just Array i.e. flatMap a column array into the same column to get a schema like

  root
     |-- uid: integer (nullable = false)
     |-- label: double (nullable = false)
     |-- features: array (nullable = false)
     |    |    |-- element: struct (containsNull = true)
     |    |    |    |-- name: string (nullable = true)
     |    |    |    |-- value: double (nullable = false)
     |    |    |    |-- term: string (nullable = true)
     |-- weight: integer (nullable = false)

Thanks in advance.

5
  • try with explode function Commented Sep 8, 2018 at 17:05
  • I dont want to get multiple rows from same row. Rather remove the extra nesting in the dataFrame i.e from Array[Array[]] to Array[] using something similar to flatten operation on array. Commented Sep 8, 2018 at 17:15
  • Could you try the flatten function on your array? All scala collections have this function. Commented Sep 8, 2018 at 17:33
  • @PJFanning can you give an example for a sample dataframe column and converting it from Array[Array] to Array type? Commented Sep 8, 2018 at 18:01
  • databricks-prod-cloudfront.cloud.databricks.com/public/… Commented Sep 8, 2018 at 22:03

1 Answer 1

1

You should read your data as a Dataset with schema:

case class Something(name: String, value: Double, term: String)
case class MyClass(uid: Int, label: Double, array: Seq[Seq[Something]], weight: Int)

then use UDF like this:

val flatUDF = udf((list: Seq[Seq[Something]]) => list.flatten)

val flattedDF = myDataFrame.withColumn("flatten", flatUDF($"features"))

example for reading dataset:

val myDataFrame = spark.read.json(path).as[MyClass]

Hope this helps.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.