Error in loading data in JSON file to Python Pandas dataframe

Question

I have a JSON file with multiple 'records' that I can easily load into a MongoDB database and then extract certain records from MongoDB into a python Pandas Dataframe. This is working just fine. However I wish to avoid this MongoDB route and directly load all the records in the JSON file into a pandas DF. I thought that this would be easy, but somehow it is not working at all.

This is what I have done

import pandas as pd
!wget -O peopleData.json -q https://github.com/prithwis/parashar21/raw/main/data/peopleDataTest5.json
data = pd.read_json('/content/peopleData.json')
#data = pd.read_json('/content/peopleData.json', lines=True)

This is throwing errors. I am using Google Colab and the notebook is available at this link.

I have seen quite a few other questions in stackoverflow that seem to address the same problem, but somehow none of the answers seem to work in my case. Will be grateful if someone can help me fix this.

You can convert your json file to a dataframe by reading it from the URL. You can use data = pd.read_json(URL, lines=True) but your json link/file seems to be not valid. jsonlint is throwing an Error: Parse error on line 193 — Timeless
– Timeless, Commented Aug 27, 2022 at 11:51
Thank you for your comment, but if you see the notebook you would see that the file does exist and gets downloaded into Colab VM with wget and then (though not shown here) is being used for other operations correctly — Calcutta
– Calcutta, Commented Aug 27, 2022 at 12:13
I looked closely at the output of jsonlint and I observe the following : I have five 'records' that are separated by blanks. That is { ... } { ... } { ... } { ... } { ... }, Whereas jsonlint expects these to be separated by , as in { ... }, { ... }, { ... }, { ... } ,{ ... } Could this the reason? — Calcutta
– Calcutta, Commented Aug 28, 2022 at 0:44

Calcutta · Accepted Answer · 2022-08-28 01:24:59Z

1

Placing a new-line character between two successive json objects solves the problem!

# Retrieve JSON file from Github 
!wget -O peopleData.json -q https://github.com/prithwis/parashar21/raw/main/data/peopleDataTest5.json
!cat peopleData.json
!grep '}{' peopleData.json
!sed -i 's/}{/}\n{/g' peopleData.json
!cat peopleData.json
data = pd.read_json('./peopleData.json', lines=True)
data

Inserted a \n between }{ using sed. Prior to this, the file was one continuous line, now it has 5 separate lines and hence read_json() function works with lines=True option

answered Aug 28, 2022 at 1:24

Calcutta

1,1595 gold badges19 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Error in loading data in JSON file to Python Pandas dataframe

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related