How to parse a file with several Json entries and get nested Json using only Pandas?

Question

I have a file containing on each line a Json entry:

{"_id":"5d42af1fb42842aa680cdba8","data_type":"8a6f03a1-4594-4133-9ba9-35e8eb83b62b","version":"1b1ec5d7-931a-4d60-b892-1db20ce2d98e","data":[{"id":"5d42af1f170e8d210fe935af","name":"Harriett Floyd"},{"id":"5d42af1fa92b30f9edbd4fb7","name":"Serrano Stein"},{"id":"5d42af1f2c1a804f5ac64491","name":"Denise Lopez"}]}
{"_id":"5d42af1fe2969c2e4064b522","data_type":"e627abb0-2b89-49af-8f26-2554dc655755","version":"4c625773-617b-460b-ac7c-8ddfb19058c8","data":[{"id":"5d42af1f0c91b1b5e484dc02","name":"Sears Gray"},{"id":"5d42af1f880d828b2e6d0c9f","name":"Carmen Britt"},{"id":"5d42af1fecdf9b333ce210a5","name":"Laura Haynes"}]}
{"_id":"5d42af1f932313d233121f52","data_type":"0189ecbd-ec19-4675-adab-0efaa4b3980e","version":"b0161b41-0f74-4040-94c7-2dc65916eb2a","data":[{"id":"5d42af1f07c1413d3cee996b","name":"Espinoza Miranda"},{"id":"5d42af1f4de7227a20790512","name":"Gallegos Everett"},{"id":"5d42af1fd65727bdeefebbc2","name":"Kristy Gates"}]}
{"_id":"5d42af1f41316fd69bb8eb65","data_type":"c69aa41d-bd7b-49b4-b147-a06a03ee14d1","version":"854427a3-1ad0-4f48-8682-197bec45c0fd","data":[{"id":"5d42af1f51417661828db0b6","name":"Morgan Osborne"},{"id":"5d42af1f8f346e78685f45d3","name":"Colleen Bray"},{"id":"5d42af1f80cd622be5c8491b","name":"Shana Henson"}]}
{"_id":"5d42af1f6f6ebc59ed4d3a04","data_type":"2d3de9f1-0a0f-41b0-8c7c-9dfb6e909a1c","version":"9ded1de4-6b01-4fbf-b150-559f7a638544","data":[{"id":"5d42af1f8c1eb70073dae767","name":"Maricela Austin"},{"id":"5d42af1f640fc89271413622","name":"Tabatha Silva"},{"id":"5d42af1f96c309104b2b8127","name":"Gail Mendez"}]}

One entry in a prettier format:

{
    "_id": "5d42af1fb42842aa680cdba8",
    "data_type": "8a6f03a1-4594-4133-9ba9-35e8eb83b62b",
    "version": "1b1ec5d7-931a-4d60-b892-1db20ce2d98e",
    "data": [
      {
        "id": "5d42af1f170e8d210fe935af",
        "name": "Harriett Floyd"
      },
      {
        "id": "5d42af1fa92b30f9edbd4fb7",
        "name": "Serrano Stein"
      },
      {
        "id": "5d42af1f2c1a804f5ac64491",
        "name": "Denise Lopez"
      }
    ]
  }

So the Json entries contain several attributes and the 'data' one contains a nested Json. What I would like to do is by only using Pandas, store all the Json entries in a DataFrame with one row per data.

I tried this:

df_json = pd.read_json(path_json_file, lines=True)

and obtained this:

                         _id                               data_type    version      data
0   5d42af1fb42842aa680cdba8    8a6f03a1-4594-4133-9ba9-35e8eb83b62b    1b1ec5d7-931a-4d60-b892-1db20ce2d98e    [{'id': '5d42af1f170e8d210fe935af', 'name': 'H...
1   5d42af1fe2969c2e4064b522    e627abb0-2b89-49af-8f26-2554dc655755    4c625773-617b-460b-ac7c-8ddfb19058c8    [{'id': '5d42af1f0c91b1b5e484dc02', 'name': 'S...
2   5d42af1f932313d233121f52    0189ecbd-ec19-4675-adab-0efaa4b3980e    b0161b41-0f74-4040-94c7-2dc65916eb2a    [{'id': '5d42af1f07c1413d3cee996b', 'name': 'E...
3   5d42af1f41316fd69bb8eb65    c69aa41d-bd7b-49b4-b147-a06a03ee14d1    854427a3-1ad0-4f48-8682-197bec45c0fd    [{'id': '5d42af1f51417661828db0b6', 'name': 'M...
4   5d42af1f6f6ebc59ed4d3a04    2d3de9f1-0a0f-41b0-8c7c-9dfb6e909a1c    9ded1de4-6b01-4fbf-b150-559f7a638544    [{'id': '5d42af1f8c1eb70073dae767', 'name': 'M...

So the 'data' column contains an array of Json but, what I want is one row for each data contained in the array.

Then I learned about the json_normalize function of Pandas and did the following:

1) I stored all the Jsons in one array by doing this:

import pandas as pd
import os
import json

json_array = []
with open(path_json_file, 'r') as f:
    for line in f:
        json_array.append(json.loads(line))

2) I stored the keys of Json except the Data to keep the columns:

key_list = list(json_array[0].keys())
key_list.remove("data")

3) I used json_normalize function:

pd.io.json.json_normalize(json_array, "data", key_list, errors="ignore", record_prefix="record_data_")

Output:

 record_data_id record_data_name    _id data_type   version
0   5d42af1f170e8d210fe935af    Harriett Floyd  5d42af1fb42842aa680cdba8    8a6f03a1-4594-4133-9ba9-35e8eb83b62b    1b1ec5d7-931a-4d60-b892-1db20ce2d98e
1   5d42af1fa92b30f9edbd4fb7    Serrano Stein   5d42af1fb42842aa680cdba8    8a6f03a1-4594-4133-9ba9-35e8eb83b62b    1b1ec5d7-931a-4d60-b892-1db20ce2d98e
2   5d42af1f2c1a804f5ac64491    Denise Lopez    5d42af1fb42842aa680cdba8    8a6f03a1-4594-4133-9ba9-35e8eb83b62b    1b1ec5d7-931a-4d60-b892-1db20ce2d98e
3   5d42af1f0c91b1b5e484dc02    Sears Gray  5d42af1fe2969c2e4064b522    e627abb0-2b89-49af-8f26-2554dc655755    4c625773-617b-460b-ac7c-8ddfb19058c8
4   5d42af1f880d828b2e6d0c9f    Carmen Britt    5d42af1fe2969c2e4064b522    e627abb0-2b89-49af-8f26-2554dc655755    4c625773-617b-460b-ac7c-8ddfb19058c8
5   5d42af1fecdf9b333ce210a5    Laura Haynes    5d42af1fe2969c2e4064b522    e627abb0-2b89-49af-8f26-2554dc655755    4c625773-617b-460b-ac7c-8ddfb19058c8
6   5d42af1f07c1413d3cee996b    Espinoza Miranda    5d42af1f932313d233121f52    0189ecbd-ec19-4675-adab-0efaa4b3980e    b0161b41-0f74-4040-94c7-2dc65916eb2a
...

This output is exactly what I want but is there a trick to use only Pandas to do that without having to parse the file myself?

The output I got by using json_normalize. Just like I said in my last sentence, I got the desirable output, but I want to know if there is something easier using only Pandas. I tried to rephrase my question so it is clearer :) — SmileyProd
– SmileyProd, Commented Aug 1, 2019 at 11:22

RomanPerekhrest · Accepted Answer · 2019-08-01 12:17:39Z

1

Short approach basing on dataframe concatenation:

import pandas as pd

df_json = pd.read_json(path_json_file, lines=True)
dfs = df_json.apply(lambda r: pd.DataFrame(r['data']).assign(**r.drop('data')), axis=1)
res = pd.concat(dfs.tolist(), ignore_index=True)
print(res.to_string())

The output:

                          id              name                       _id                             data_type                               version
0   5d42af1f170e8d210fe935af    Harriett Floyd  5d42af1fb42842aa680cdba8  8a6f03a1-4594-4133-9ba9-35e8eb83b62b  1b1ec5d7-931a-4d60-b892-1db20ce2d98e
1   5d42af1fa92b30f9edbd4fb7     Serrano Stein  5d42af1fb42842aa680cdba8  8a6f03a1-4594-4133-9ba9-35e8eb83b62b  1b1ec5d7-931a-4d60-b892-1db20ce2d98e
2   5d42af1f2c1a804f5ac64491      Denise Lopez  5d42af1fb42842aa680cdba8  8a6f03a1-4594-4133-9ba9-35e8eb83b62b  1b1ec5d7-931a-4d60-b892-1db20ce2d98e
3   5d42af1f0c91b1b5e484dc02        Sears Gray  5d42af1fe2969c2e4064b522  e627abb0-2b89-49af-8f26-2554dc655755  4c625773-617b-460b-ac7c-8ddfb19058c8
4   5d42af1f880d828b2e6d0c9f      Carmen Britt  5d42af1fe2969c2e4064b522  e627abb0-2b89-49af-8f26-2554dc655755  4c625773-617b-460b-ac7c-8ddfb19058c8
5   5d42af1fecdf9b333ce210a5      Laura Haynes  5d42af1fe2969c2e4064b522  e627abb0-2b89-49af-8f26-2554dc655755  4c625773-617b-460b-ac7c-8ddfb19058c8
6   5d42af1f07c1413d3cee996b  Espinoza Miranda  5d42af1f932313d233121f52  0189ecbd-ec19-4675-adab-0efaa4b3980e  b0161b41-0f74-4040-94c7-2dc65916eb2a
7   5d42af1f4de7227a20790512  Gallegos Everett  5d42af1f932313d233121f52  0189ecbd-ec19-4675-adab-0efaa4b3980e  b0161b41-0f74-4040-94c7-2dc65916eb2a
8   5d42af1fd65727bdeefebbc2      Kristy Gates  5d42af1f932313d233121f52  0189ecbd-ec19-4675-adab-0efaa4b3980e  b0161b41-0f74-4040-94c7-2dc65916eb2a
9   5d42af1f51417661828db0b6    Morgan Osborne  5d42af1f41316fd69bb8eb65  c69aa41d-bd7b-49b4-b147-a06a03ee14d1  854427a3-1ad0-4f48-8682-197bec45c0fd
10  5d42af1f8f346e78685f45d3      Colleen Bray  5d42af1f41316fd69bb8eb65  c69aa41d-bd7b-49b4-b147-a06a03ee14d1  854427a3-1ad0-4f48-8682-197bec45c0fd
11  5d42af1f80cd622be5c8491b      Shana Henson  5d42af1f41316fd69bb8eb65  c69aa41d-bd7b-49b4-b147-a06a03ee14d1  854427a3-1ad0-4f48-8682-197bec45c0fd
12  5d42af1f8c1eb70073dae767   Maricela Austin  5d42af1f6f6ebc59ed4d3a04  2d3de9f1-0a0f-41b0-8c7c-9dfb6e909a1c  9ded1de4-6b01-4fbf-b150-559f7a638544
13  5d42af1f640fc89271413622     Tabatha Silva  5d42af1f6f6ebc59ed4d3a04  2d3de9f1-0a0f-41b0-8c7c-9dfb6e909a1c  9ded1de4-6b01-4fbf-b150-559f7a638544
14  5d42af1f96c309104b2b8127       Gail Mendez  5d42af1f6f6ebc59ed4d3a04  2d3de9f1-0a0f-41b0-8c7c-9dfb6e909a1c  9ded1de4-6b01-4fbf-b150-559f7a638544

answered Aug 1, 2019 at 12:17

RomanPerekhrest

93.1k4 gold badges75 silver badges112 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

SmileyProd Over a year ago

Thanks perfect, I knew something should be doable using Pandas but couldn't find it ! Do you think that your approach is heavier in memory usage during computation than mine ? Because I have like 500 000 json entries in separated files so I might need to think more about time and memory usage than the number of lines of code :)

RomanPerekhrest Over a year ago

@SmileyProd, welcome, I would say: analyze how many files should be processed at once (approx. records per file). Try to launch processing with current approach, measure timings and if it'll lead to a performance hit - feel free to create a new question for that. You may then ping me sharing a new link/question, I'll try to help

Collectives™ on Stack Overflow

How to parse a file with several Json entries and get nested Json using only Pandas?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related