Pandas | JSON Normalize | Deeply Nested Sections

Question

I have a data structure which is a very complex/nested JSON.

After reading other solutions on SO, I am comfortable with the fact that I may need to make multiple DFs and merge them together (e.g json_normalize with multiple record paths).

{
  "topLevelIdentifier": "a86507",
  "activities": [
    {
      "activityId": "1f13d848",
      "taskList": [
        {
          "taskIdentifier": "c6ba-1012",
          "taskTime": {
            "value": 0,
            "unit": "SECONDS"
          },
          "taskLocationIdentifier": "3438"
        },
        {
          "taskIdentifier": "cc48-07bf",
          "taskTime": {
            "value": 0,
            "unit": "SECONDS"
          },
          "taskLocationIdentifier": "4b32"
        },
        {
          "taskIdentifier": "35b896",
          "taskTime": {
            "value": 0,
            "unit": "SECONDS"
          },
          "taskLocationIdentifier": "7bcca"
        }
      ],
      "sectionB": {
        "someCapacityA": {
          "totalVolume": {
            "value": 8415,
            "unit": "CU_CM"
          },
          "totalWeight": {
            "value": 1059.0,
            "unit": "GRAMS"
          }
        },
        "someCapacityB": {
          "totalVolume": {
            "value": 0.0,
            "unit": "CU_CM"
          },
          "totalWeight": {
            "value": 0.0,
            "unit": "GRAMS"
          }
        }
      },
 "sectionC": {....},
"sectionD": {....}, 
"sectonE":{...}
}]}

I can create a base for the merges by doing this:

with open('my_file.json') as simulation_file:    
    simulation_data = json.load(simulation_file) 
df_base = pd.json_normalize(simulation_data, 'activities',['topLevelIdentifier'])

Then normalizing the next part isn't too bad:

df_taskList = pd.json_normalize(simulation_data['activities'],
                               'taskList',['activityId'])

It's normalizing the next section that has me stuck. Here's what I'm trying but is throwing an error:

df_sectionB = pd.json_normalize(simulation_data['activities'],
                                
                                'activityId',['taskList',['taskIdentifier',['taskTime',['unit', 'value']], 'taskLocationIdentifier']])

I'm basically trying to flatten out sectionBsuch that it has the activityId so that I can merge all the pieces together.

If there's a simpler way to flatten out to the deepest layer in a complex format like this, I'm open to alternative approaches!

Perhaps you could df_base = pd.json_normalize(simulation_data, 'activities', 'topLevelIdentifier').explode('taskList') - which would allow you to then pd.json_normalize(df_base['taskList']).set_index(df_base.index) — jqurious
– jqurious, Commented Feb 21, 2023 at 0:46

jqurious · Accepted Answer · 2023-02-21 09:57:12Z

You could add .explode("taskList") to the result of the first normalize.

You can then normalize the taskList column, copy the base index, and concat the results.

df_base = pd.json_normalize(
   simulation_data, 
   record_path="activities", 
   meta="topLevelIdentifier"
).explode("taskList")

df = pd.concat(
   [ 
      df_base.drop(columns="taskList"),
      pd.json_normalize(df_base["taskList"]).set_index(df_base.index)
   ],
   axis=1
)

>>> df.T
                                                   0           0           0
activityId                                  1f13d848    1f13d848    1f13d848
sectionC                                  {Ellipsis}  {Ellipsis}  {Ellipsis}
sectionD                                  {Ellipsis}  {Ellipsis}  {Ellipsis}
sectionE                                  {Ellipsis}  {Ellipsis}  {Ellipsis}
sectionB.someCapacityA.totalVolume.value        8415        8415        8415
sectionB.someCapacityA.totalVolume.unit        CU_CM       CU_CM       CU_CM
sectionB.someCapacityA.totalWeight.value      1059.0      1059.0      1059.0
sectionB.someCapacityA.totalWeight.unit        GRAMS       GRAMS       GRAMS
sectionB.someCapacityB.totalVolume.value         0.0         0.0         0.0
sectionB.someCapacityB.totalVolume.unit        CU_CM       CU_CM       CU_CM
sectionB.someCapacityB.totalWeight.value         0.0         0.0         0.0
sectionB.someCapacityB.totalWeight.unit        GRAMS       GRAMS       GRAMS
topLevelIdentifier                            a86507      a86507      a86507
taskIdentifier                             c6ba-1012   cc48-07bf      35b896
taskLocationIdentifier                          3438        4b32       7bcca
taskTime.value                                     0           0           0
taskTime.unit                                SECONDS     SECONDS     SECONDS

Jonathan Leon · Accepted Answer · 2023-02-21 03:34:14Z

You can try flatten_json. I had to remove Sections C, D, E from your example, so it' s unclear if this will work as you need it to. Refer to documentation for more information.

set your dictionary equal to test_json

from flatten_json import flatten

dic_flattened = (flatten(d, '.') for d in [test_json]) 'note test_json needs to be in square brackets or you'll get an error
df = pd.DataFrame(dic_flattened)
df

  topLevelIdentifier activities.0.activityId  ... activities.0.sectionB.someCapacityB.totalWeight.value  activities.0.sectionB.someCapacityB.totalWeight.unit
0             a86507                1f13d848  ...                                               0.00000                                                 GRAMS

[1 rows x 22 columns]

' or transpose it to see the columns better
df.T

                                                               0
topLevelIdentifier                                        a86507
activities.0.activityId                                 1f13d848
activities.0.taskList.0.taskIdentifier                 c6ba-1012
activities.0.taskList.0.taskTime.value                         0
activities.0.taskList.0.taskTime.unit                    SECONDS
activities.0.taskList.0.taskLocationIdentifier              3438
activities.0.taskList.1.taskIdentifier                 cc48-07bf
activities.0.taskList.1.taskTime.value                         0
activities.0.taskList.1.taskTime.unit                    SECONDS
activities.0.taskList.1.taskLocationIdentifier              4b32
activities.0.taskList.2.taskIdentifier                    35b896
activities.0.taskList.2.taskTime.value                         0
activities.0.taskList.2.taskTime.unit                    SECONDS
activities.0.taskList.2.taskLocationIdentifier             7bcca
activities.0.sectionB.someCapacityA.totalVolume.value       8415
activities.0.sectionB.someCapacityA.totalVolume.unit       CU_CM
activities.0.sectionB.someCapacityA.totalWeight.value 1059.00000
activities.0.sectionB.someCapacityA.totalWeight.unit       GRAMS
activities.0.sectionB.someCapacityB.totalVolume.value    0.00000
activities.0.sectionB.someCapacityB.totalVolume.unit       CU_CM
activities.0.sectionB.someCapacityB.totalWeight.value    0.00000
activities.0.sectionB.someCapacityB.totalWeight.unit       GRAMS

Collectives™ on Stack Overflow

Pandas | JSON Normalize | Deeply Nested Sections

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related