3

I have the following JSON objects:

{
    "user_id": "123",
    "data": {
        "city": "New York"
    },
    "timestamp": "1563188698.31",
    "session_id": "6a793439-6535-4162-b333-647a6761636b"
}
{
    "user_id": "123",
    "data": {
        "name": "some_name",
        "age": "23",
        "occupation": "teacher"
    },
    "timestamp": "1563188698.31",
    "session_id": "6a793439-6535-4162-b333-647a6761636b"
}

I'm using val df = sqlContext.read.json("json") to read the file to dataframe

Which combines all data attributes into data struct like so:

root
 |-- data: struct (nullable = true)
 |    |-- age: string (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- occupation: string (nullable = true)
 |-- session_id: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- user_id: string (nullable = true)

Is it possible to transform data field to MAP[String, String] Data type? And so it only has the same attributes as original json?

2
  • Hi! Transforming Spark Dataframe Row into Scala Map is not a straightforward task. I can help you with it but you must specify more details about your use case. What do you want to do with the Map objects? What kind of operation do you want to perform with the nested data? Commented Jul 15, 2019 at 19:47
  • Hi @ÁlvaroValencia, I'm looking to generate parquet files from json. I'm using Athena on AWS and need to match the table format to make the data queryable. Thank you Commented Jul 15, 2019 at 20:14

2 Answers 2

7

Yes you can achieve that by exporting a Map[String, String] from the JSON data as shown next:

import org.apache.spark.sql.types.{MapType, StringType}
import org.apache.spark.sql.functions.{to_json, from_json}

val jsonStr = """{
    "user_id": "123",
    "data": {
        "name": "some_name",
        "age": "23",
        "occupation": "teacher"
    },
    "timestamp": "1563188698.31",
    "session_id": "6a793439-6535-4162-b333-647a6761636b"
}"""

val df = spark.read.json(Seq(jsonStr).toDS)

val mappingSchema = MapType(StringType, StringType)

df.select(from_json(to_json($"data"), mappingSchema).as("map_data"))

//Output
// +-----------------------------------------------------+
// |map_data                                             |
// +-----------------------------------------------------+
// |[age -> 23, name -> some_name, occupation -> teacher]|
// +-----------------------------------------------------+

First we extract the content of the data field into a string with to_json($"data"), then we parse and extract the Map with from_json(to_json($"data"), schema).

Sign up to request clarification or add additional context in comments.

2 Comments

This works! Just need to append the column to df and work with one json at the time. Thank you
Yes exactly @stepandel
1

Not sure what you mean to convert it to a Map of (String, String), But see if below can help.

val dataDF = spark.read.option("multiline","true").json("madhu/user.json").select("data").toDF

dataDF
.withColumn("age", $"data"("age")).withColumn("city", $"data"("city"))
.withColumn("name", $"data"("name"))
.withColumn("occupation", $"data"("occupation"))
.drop("data")
.show

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.