2

There is an array field in dataset like:

my_array:
[
{id: 1, value: x},
{id: 2, value: y}
]

How to make it like:

my_strcut: {
  1: {value: x},
  2: {value: y}
}

I have tried map_from_entries with transform but still have array of structs as output.

UPDATED

There is a dataset which read data from json. Data like that:

{"id":1, ... "arrayOfStructs" : [{"name": "x", "key":"value"}, {"name": "y", "key":"value2"}]}

The output should be something the like:

{"id":1, ... "structsOnly" : { "x": {"name": "x", "key":"value"}}, { "y": {"name": "y", "key":"value2"}}}
1
  • Curious about the ID numbers as column names. Are they the same across all the rows? Spark DF needs a well-defined schema and stable column names. Commented Jan 19, 2022 at 12:04

2 Answers 2

1

I think you want to use MapType not StructType in this case, as struct requires you to know all the values for field id. Something like this using transform + aggregate functions:

val df1 = df.withColumn(
    "structsOnly",
    expr("""aggregate(
              transform(arrayOfStructs, x -> map(x.name, x)), 
              cast(map() as map<string,struct<name:string,key:string>>), 
              (acc, x) -> map_concat(acc, x)
           )
    """)
  ).drop("arrayOfStructs")

df1.printSchema
//root
// |-- id: integer (nullable = false)
// |-- structsOnly: map (nullable = true)
// |    |-- key: string
// |    |-- value: struct (valueContainsNull = true)
// |    |    |-- name: string (nullable = true)
// |    |    |-- key: string (nullable = true)

df1.toJSON.show(false)
//+---------------------------------------------------------------------------------------+
//|value                                                                                  |
//+---------------------------------------------------------------------------------------+
//|{"id":1,"structsOnly":{"x":{"name":"x","key":"value"},"y":{"name":"y","key":"value2"}}}|
//+---------------------------------------------------------------------------------------+

Now, if you really want to have struct type column then you'll need to collect all possible values of field key then construct the the column like this:

val keys = df1.select(map_keys($"structsOnly")).as[Seq[String]].collect.flatten.distinct

val df2 = df1.withColumn(
  "structsOnly",
  struct(keys.map(k => col("structsOnly").getField(k).as(k)): _*)
)
Sign up to request clarification or add additional context in comments.

Comments

1

This may seem like a simple task from the first glance, but not so much...

Using this as input:

case class Strct(id: Int, value: String)
val df = Seq(Seq(Strct(1, "x"), Strct(2, "y"))).toDF("my_array")

print(df.toJSON.head())
// {"my_array":[{"id":1,"value":"x"},{"id":2,"value":"y"}]}

df.printSchema()
// root
//  |-- my_array: array (nullable = true)
//  |    |-- element: struct (containsNull = true)
//  |    |    |-- id: integer (nullable = false)
//  |    |    |-- value: string (nullable = true)

I would create a map in order to extract the schema for subsequent conversion to struct.

val json_col = to_json(aggregate(
    transform($"my_array", x => x.withField("value", x.dropFields("id"))),
    map().cast("map<int,struct<value:string>>"),
    (acc, x) => map_concat(acc, map_from_entries(array(x)))
))
val json_schema = spark.read.json(df.select(json_col).as[String]).schema
val df2 = df.select(from_json(json_col, json_schema).alias("my_struct"))

Result:

print(df2.toJSON.head())
// {"my_struct":{"1":{"value":"x"},"2":{"value":"y"}}}

df2.printSchema()
// root
//  |-- my_struct: struct (nullable = true)
//  |    |-- 1: struct (nullable = true)
//  |    |    |-- value: string (nullable = true)
//  |    |-- 2: struct (nullable = true)
//  |    |    |-- value: string (nullable = true)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.