0

Click here to see Image

I have this type of file with data where each line is a JSON object except first few words(see attached image). I want to parse this type of file using Spark and Scala. I have tried it using sqlContext.read.json(“path to json file”) but it gives me error(corrupt data) because whole data is not a JSON object. How do I parse this JSON file to SQL dataframe?

6
  • If you have invalid JSON, you can't parse it using any tool Commented Mar 3, 2017 at 8:58
  • is this invalid JSON? Commented Mar 3, 2017 at 9:00
  • Well, the fact that you have non JSON data before the actual JSON, then yes, it's not valid in Sparks eyes. You need to extract that data separately Commented Mar 3, 2017 at 9:03
  • is there any way in Spark to extract data separately? Commented Mar 3, 2017 at 9:07
  • @AkhilChoudhari is these "first few words" have the same length in all rows? Commented Mar 3, 2017 at 9:15

1 Answer 1

1

Try this:

val rawRdd = sc.textFile("path-to-the-file")
val jsonRdd = rawRdd.map(_.substring(32)) //32 - number of first characters to ignore

val df = spark.read.json(jsonRdd)
Sign up to request clarification or add additional context in comments.

5 Comments

Last command gave me an error shown below. at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:287) at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229)
it would be easier if you could provide some example data to test.
what version of spark do you use?
When I provide whole 20MB file to spark.jso.read(), its not working. but its working for half of the file. why?
It gives me error that : org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.