3

I have tens of thousands rows of json snippets like this in a pandas series df["json"]

[{
    'IDs': [{
        'lotId': '1',
        'Id': '123456'
    }],
    'date': '2009-04-17',
    'bidsCount': 2,
}, {
    'IDs': [{
        'lotId': '2',
        'Id': '123456'
    }],
    'date': '2009-04-17',
    'bidsCount': 4,
}, {
    'IDs': [{
         'lotId': '3',
         'Id': '123456'
    }],
    'date': '2009-04-17',
    'bidsCount': 8,
}]

Sample of the original file:

{"type": "OPEN","title": "rainbow","json": [{"IDs": [{"lotId": "1","Id": "123456"}],"date": "2009-04-17","bidsCount": 2,}, {"IDs": [{"lotId": "2","Id": "123456"}],"date": "2009-04-17","bidsCount": 4,}, {"IDs": [{"lotId": "3","Id": "123456"}],"date": "2009-04-17","bidsCount": 8,}]}
{"type": "CLOSED","title": "clouds","json": [{"IDs": [{"lotId": "1","Id": "23345"}],"date": "2009-05-17","bidsCount": 2,}, {"IDs": [{"lotId": "2","Id": "23345"}],"date": "2009-05-17","bidsCount": 4,}, {"IDs": [{"lotId": "3","Id": "23345"}],"date": "2009-05-17","bidsCount": 8,}]}


df = pd.read_json("file.json", lines=True)

I am trying to make them into a data frame, something like

Id      lotId      bidsCount    date
123456  1          2            2009-04-17
123456  2          4            2009-04-17
123456  3          8            2009-04-17

by using

json_normalize(df["json"])

However I get

AttributeError: 'list' object has no attribute 'values'

I guess the json snippet is seen as a list, however I can not figure out how to make it work otherwise. Help appreciated!

9
  • How do you create df first? Commented Jul 26, 2017 at 11:16
  • Please paste your data frame's head here. Is your jsons column a string? Commented Jul 26, 2017 at 11:28
  • zufanka first of all as the documentation says, the df['jsons'] should be a dict or list of dict. Then you could do result = json_normalize(data, 'IDs', ['date', 'bidsCount']) like this to get your desired result. I did same in my answer, don't know why people like to downvote. hope this helps Commented Jul 26, 2017 at 11:54
  • I create the df from an enormous json file through pd.read_json("file.json", lines=True) . The json column is one of the files nested parts, not a string. I can try to recreate the file, as the data is confidential if that would help. Commented Jul 26, 2017 at 11:55
  • zufanka, yes. just to type(df['json']) to make sure that its a dict, or list of dict to work with json_normalize(). If you could tell how you're creating the df['json'] then it would help. You don't need to recreate the whole data just a sample would be great. Commented Jul 26, 2017 at 11:59

1 Answer 1

18

I think your df['json'] is a nested list. You can use a for loop and concatenate the dataframe to get the big dataframe i.e

Data:

{"type": "OPEN","title": "rainbow","json": [{"IDs": [{"lotId": "1","Id": "123456"}],"date": "2009-04-17","bidsCount": 2,}, {"IDs": [{"lotId": "2","Id": "123456"}],"date": "2009-04-17","bidsCount": 4,}, {"IDs": [{"lotId": "3","Id": "123456"}],"date": "2009-04-17","bidsCount": 8,}]}
{"type": "CLOSED","title": "clouds","json": [{"IDs": [{"lotId": "1","Id": "23345"}],"date": "2009-05-17","bidsCount": 2,}, {"IDs": [{"lotId": "2","Id": "23345"}],"date": "2009-05-17","bidsCount": 4,}, {"IDs": [{"lotId": "3","Id": "23345"}],"date": "2009-05-17","bidsCount": 8,}]}

df = pd.read_json("file.json", lines=True)

DataFrame:

new_df = pd.concat([pd.DataFrame(json_normalize(x)) for x in df['json']],ignore_index=True)

Output:

                                IDs  bidsCount        date
0  [{'Id': '123456', 'lotId': '1'}]          2  2009-04-17
1  [{'Id': '123456', 'lotId': '2'}]          4  2009-04-17
2  [{'Id': '123456', 'lotId': '3'}]          8  2009-04-17
3   [{'Id': '23345', 'lotId': '1'}]          2  2009-05-17
4   [{'Id': '23345', 'lotId': '2'}]          4  2009-05-17
5   [{'Id': '23345', 'lotId': '3'}]          8  2009-05-17

If you want the keys of IDs as columns then you use

new_df['lotId'] = [x[0]['lotId'] for x in new_df['IDs']]
new_df['IDs'] = [x[0]['Id'] for x in new_df['IDs']]
      IDs  bidsCount        date lotId
0  123456          2  2009-04-17     1
1  123456          4  2009-04-17     2
2  123456          8  2009-04-17     3
3   23345          2  2009-05-17     1
4   23345          4  2009-05-17     2
5   23345          8  2009-05-17     3
Sign up to request clarification or add additional context in comments.

3 Comments

does exactly what I need, many thanks! Just needed to add df['json'].dropna() as some of the data is missing.
Glad it helped!
Any more efficient approaches to this?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.