1

I have a file containing on each line a Json entry:

{"_id":"5d42af1fb42842aa680cdba8","data_type":"8a6f03a1-4594-4133-9ba9-35e8eb83b62b","version":"1b1ec5d7-931a-4d60-b892-1db20ce2d98e","data":[{"id":"5d42af1f170e8d210fe935af","name":"Harriett Floyd"},{"id":"5d42af1fa92b30f9edbd4fb7","name":"Serrano Stein"},{"id":"5d42af1f2c1a804f5ac64491","name":"Denise Lopez"}]}
{"_id":"5d42af1fe2969c2e4064b522","data_type":"e627abb0-2b89-49af-8f26-2554dc655755","version":"4c625773-617b-460b-ac7c-8ddfb19058c8","data":[{"id":"5d42af1f0c91b1b5e484dc02","name":"Sears Gray"},{"id":"5d42af1f880d828b2e6d0c9f","name":"Carmen Britt"},{"id":"5d42af1fecdf9b333ce210a5","name":"Laura Haynes"}]}
{"_id":"5d42af1f932313d233121f52","data_type":"0189ecbd-ec19-4675-adab-0efaa4b3980e","version":"b0161b41-0f74-4040-94c7-2dc65916eb2a","data":[{"id":"5d42af1f07c1413d3cee996b","name":"Espinoza Miranda"},{"id":"5d42af1f4de7227a20790512","name":"Gallegos Everett"},{"id":"5d42af1fd65727bdeefebbc2","name":"Kristy Gates"}]}
{"_id":"5d42af1f41316fd69bb8eb65","data_type":"c69aa41d-bd7b-49b4-b147-a06a03ee14d1","version":"854427a3-1ad0-4f48-8682-197bec45c0fd","data":[{"id":"5d42af1f51417661828db0b6","name":"Morgan Osborne"},{"id":"5d42af1f8f346e78685f45d3","name":"Colleen Bray"},{"id":"5d42af1f80cd622be5c8491b","name":"Shana Henson"}]}
{"_id":"5d42af1f6f6ebc59ed4d3a04","data_type":"2d3de9f1-0a0f-41b0-8c7c-9dfb6e909a1c","version":"9ded1de4-6b01-4fbf-b150-559f7a638544","data":[{"id":"5d42af1f8c1eb70073dae767","name":"Maricela Austin"},{"id":"5d42af1f640fc89271413622","name":"Tabatha Silva"},{"id":"5d42af1f96c309104b2b8127","name":"Gail Mendez"}]}

One entry in a prettier format:

{
    "_id": "5d42af1fb42842aa680cdba8",
    "data_type": "8a6f03a1-4594-4133-9ba9-35e8eb83b62b",
    "version": "1b1ec5d7-931a-4d60-b892-1db20ce2d98e",
    "data": [
      {
        "id": "5d42af1f170e8d210fe935af",
        "name": "Harriett Floyd"
      },
      {
        "id": "5d42af1fa92b30f9edbd4fb7",
        "name": "Serrano Stein"
      },
      {
        "id": "5d42af1f2c1a804f5ac64491",
        "name": "Denise Lopez"
      }
    ]
  }

So the Json entries contain several attributes and the 'data' one contains a nested Json. What I would like to do is by only using Pandas, store all the Json entries in a DataFrame with one row per data.

I tried this:

df_json = pd.read_json(path_json_file, lines=True)

and obtained this:

                         _id                               data_type    version      data
0   5d42af1fb42842aa680cdba8    8a6f03a1-4594-4133-9ba9-35e8eb83b62b    1b1ec5d7-931a-4d60-b892-1db20ce2d98e    [{'id': '5d42af1f170e8d210fe935af', 'name': 'H...
1   5d42af1fe2969c2e4064b522    e627abb0-2b89-49af-8f26-2554dc655755    4c625773-617b-460b-ac7c-8ddfb19058c8    [{'id': '5d42af1f0c91b1b5e484dc02', 'name': 'S...
2   5d42af1f932313d233121f52    0189ecbd-ec19-4675-adab-0efaa4b3980e    b0161b41-0f74-4040-94c7-2dc65916eb2a    [{'id': '5d42af1f07c1413d3cee996b', 'name': 'E...
3   5d42af1f41316fd69bb8eb65    c69aa41d-bd7b-49b4-b147-a06a03ee14d1    854427a3-1ad0-4f48-8682-197bec45c0fd    [{'id': '5d42af1f51417661828db0b6', 'name': 'M...
4   5d42af1f6f6ebc59ed4d3a04    2d3de9f1-0a0f-41b0-8c7c-9dfb6e909a1c    9ded1de4-6b01-4fbf-b150-559f7a638544    [{'id': '5d42af1f8c1eb70073dae767', 'name': 'M...

So the 'data' column contains an array of Json but, what I want is one row for each data contained in the array.

Then I learned about the json_normalize function of Pandas and did the following:

1) I stored all the Jsons in one array by doing this:

import pandas as pd
import os
import json

json_array = []
with open(path_json_file, 'r') as f:
    for line in f:
        json_array.append(json.loads(line))

2) I stored the keys of Json except the Data to keep the columns:

key_list = list(json_array[0].keys())
key_list.remove("data")

3) I used json_normalize function:

pd.io.json.json_normalize(json_array, "data", key_list, errors="ignore", record_prefix="record_data_")

Output:

 record_data_id record_data_name    _id data_type   version
0   5d42af1f170e8d210fe935af    Harriett Floyd  5d42af1fb42842aa680cdba8    8a6f03a1-4594-4133-9ba9-35e8eb83b62b    1b1ec5d7-931a-4d60-b892-1db20ce2d98e
1   5d42af1fa92b30f9edbd4fb7    Serrano Stein   5d42af1fb42842aa680cdba8    8a6f03a1-4594-4133-9ba9-35e8eb83b62b    1b1ec5d7-931a-4d60-b892-1db20ce2d98e
2   5d42af1f2c1a804f5ac64491    Denise Lopez    5d42af1fb42842aa680cdba8    8a6f03a1-4594-4133-9ba9-35e8eb83b62b    1b1ec5d7-931a-4d60-b892-1db20ce2d98e
3   5d42af1f0c91b1b5e484dc02    Sears Gray  5d42af1fe2969c2e4064b522    e627abb0-2b89-49af-8f26-2554dc655755    4c625773-617b-460b-ac7c-8ddfb19058c8
4   5d42af1f880d828b2e6d0c9f    Carmen Britt    5d42af1fe2969c2e4064b522    e627abb0-2b89-49af-8f26-2554dc655755    4c625773-617b-460b-ac7c-8ddfb19058c8
5   5d42af1fecdf9b333ce210a5    Laura Haynes    5d42af1fe2969c2e4064b522    e627abb0-2b89-49af-8f26-2554dc655755    4c625773-617b-460b-ac7c-8ddfb19058c8
6   5d42af1f07c1413d3cee996b    Espinoza Miranda    5d42af1f932313d233121f52    0189ecbd-ec19-4675-adab-0efaa4b3980e    b0161b41-0f74-4040-94c7-2dc65916eb2a
...

This output is exactly what I want but is there a trick to use only Pandas to do that without having to parse the file myself?

2
  • 1
    What is the expected output? Commented Aug 1, 2019 at 11:20
  • The output I got by using json_normalize. Just like I said in my last sentence, I got the desirable output, but I want to know if there is something easier using only Pandas. I tried to rephrase my question so it is clearer :) Commented Aug 1, 2019 at 11:22

1 Answer 1

1

Short approach basing on dataframe concatenation:

import pandas as pd

df_json = pd.read_json(path_json_file, lines=True)
dfs = df_json.apply(lambda r: pd.DataFrame(r['data']).assign(**r.drop('data')), axis=1)
res = pd.concat(dfs.tolist(), ignore_index=True)
print(res.to_string())

The output:

                          id              name                       _id                             data_type                               version
0   5d42af1f170e8d210fe935af    Harriett Floyd  5d42af1fb42842aa680cdba8  8a6f03a1-4594-4133-9ba9-35e8eb83b62b  1b1ec5d7-931a-4d60-b892-1db20ce2d98e
1   5d42af1fa92b30f9edbd4fb7     Serrano Stein  5d42af1fb42842aa680cdba8  8a6f03a1-4594-4133-9ba9-35e8eb83b62b  1b1ec5d7-931a-4d60-b892-1db20ce2d98e
2   5d42af1f2c1a804f5ac64491      Denise Lopez  5d42af1fb42842aa680cdba8  8a6f03a1-4594-4133-9ba9-35e8eb83b62b  1b1ec5d7-931a-4d60-b892-1db20ce2d98e
3   5d42af1f0c91b1b5e484dc02        Sears Gray  5d42af1fe2969c2e4064b522  e627abb0-2b89-49af-8f26-2554dc655755  4c625773-617b-460b-ac7c-8ddfb19058c8
4   5d42af1f880d828b2e6d0c9f      Carmen Britt  5d42af1fe2969c2e4064b522  e627abb0-2b89-49af-8f26-2554dc655755  4c625773-617b-460b-ac7c-8ddfb19058c8
5   5d42af1fecdf9b333ce210a5      Laura Haynes  5d42af1fe2969c2e4064b522  e627abb0-2b89-49af-8f26-2554dc655755  4c625773-617b-460b-ac7c-8ddfb19058c8
6   5d42af1f07c1413d3cee996b  Espinoza Miranda  5d42af1f932313d233121f52  0189ecbd-ec19-4675-adab-0efaa4b3980e  b0161b41-0f74-4040-94c7-2dc65916eb2a
7   5d42af1f4de7227a20790512  Gallegos Everett  5d42af1f932313d233121f52  0189ecbd-ec19-4675-adab-0efaa4b3980e  b0161b41-0f74-4040-94c7-2dc65916eb2a
8   5d42af1fd65727bdeefebbc2      Kristy Gates  5d42af1f932313d233121f52  0189ecbd-ec19-4675-adab-0efaa4b3980e  b0161b41-0f74-4040-94c7-2dc65916eb2a
9   5d42af1f51417661828db0b6    Morgan Osborne  5d42af1f41316fd69bb8eb65  c69aa41d-bd7b-49b4-b147-a06a03ee14d1  854427a3-1ad0-4f48-8682-197bec45c0fd
10  5d42af1f8f346e78685f45d3      Colleen Bray  5d42af1f41316fd69bb8eb65  c69aa41d-bd7b-49b4-b147-a06a03ee14d1  854427a3-1ad0-4f48-8682-197bec45c0fd
11  5d42af1f80cd622be5c8491b      Shana Henson  5d42af1f41316fd69bb8eb65  c69aa41d-bd7b-49b4-b147-a06a03ee14d1  854427a3-1ad0-4f48-8682-197bec45c0fd
12  5d42af1f8c1eb70073dae767   Maricela Austin  5d42af1f6f6ebc59ed4d3a04  2d3de9f1-0a0f-41b0-8c7c-9dfb6e909a1c  9ded1de4-6b01-4fbf-b150-559f7a638544
13  5d42af1f640fc89271413622     Tabatha Silva  5d42af1f6f6ebc59ed4d3a04  2d3de9f1-0a0f-41b0-8c7c-9dfb6e909a1c  9ded1de4-6b01-4fbf-b150-559f7a638544
14  5d42af1f96c309104b2b8127       Gail Mendez  5d42af1f6f6ebc59ed4d3a04  2d3de9f1-0a0f-41b0-8c7c-9dfb6e909a1c  9ded1de4-6b01-4fbf-b150-559f7a638544
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks perfect, I knew something should be doable using Pandas but couldn't find it ! Do you think that your approach is heavier in memory usage during computation than mine ? Because I have like 500 000 json entries in separated files so I might need to think more about time and memory usage than the number of lines of code :)
@SmileyProd, welcome, I would say: analyze how many files should be processed at once (approx. records per file). Try to launch processing with current approach, measure timings and if it'll lead to a performance hit - feel free to create a new question for that. You may then ping me sharing a new link/question, I'll try to help

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.