0

I have a JSON file with multiple 'records' that I can easily load into a MongoDB database and then extract certain records from MongoDB into a python Pandas Dataframe. This is working just fine. However I wish to avoid this MongoDB route and directly load all the records in the JSON file into a pandas DF. I thought that this would be easy, but somehow it is not working at all.

This is what I have done

import pandas as pd
!wget -O peopleData.json -q https://github.com/prithwis/parashar21/raw/main/data/peopleDataTest5.json
data = pd.read_json('/content/peopleData.json')
#data = pd.read_json('/content/peopleData.json', lines=True)

This is throwing errors. I am using Google Colab and the notebook is available at this link.

I have seen quite a few other questions in stackoverflow that seem to address the same problem, but somehow none of the answers seem to work in my case. Will be grateful if someone can help me fix this.

4
  • You can convert your json file to a dataframe by reading it from the URL. You can use data = pd.read_json(URL, lines=True) but your json link/file seems to be not valid. jsonlint is throwing an Error: Parse error on line 193 Commented Aug 27, 2022 at 11:51
  • Thank you for your comment, but if you see the notebook you would see that the file does exist and gets downloaded into Colab VM with wget and then (though not shown here) is being used for other operations correctly Commented Aug 27, 2022 at 12:13
  • I looked closely at the output of jsonlint and I observe the following : I have five 'records' that are separated by blanks. That is { ... } { ... } { ... } { ... } { ... }, Whereas jsonlint expects these to be separated by , as in { ... }, { ... }, { ... }, { ... } ,{ ... } Could this the reason? Commented Aug 28, 2022 at 0:44
  • That was indeed the problem. See posted solution Commented Aug 28, 2022 at 1:39

1 Answer 1

1

Placing a new-line character between two successive json objects solves the problem!

# Retrieve JSON file from Github 
!wget -O peopleData.json -q https://github.com/prithwis/parashar21/raw/main/data/peopleDataTest5.json
!cat peopleData.json
!grep '}{' peopleData.json
!sed -i 's/}{/}\n{/g' peopleData.json
!cat peopleData.json
data = pd.read_json('./peopleData.json', lines=True)
data

Inserted a \n between }{ using sed. Prior to this, the file was one continuous line, now it has 5 separate lines and hence read_json() function works with lines=True option

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.