0

I have a data structure which is a very complex/nested JSON.

After reading other solutions on SO, I am comfortable with the fact that I may need to make multiple DFs and merge them together (e.g json_normalize with multiple record paths).

{
  "topLevelIdentifier": "a86507",
  "activities": [
    {
      "activityId": "1f13d848",
      "taskList": [
        {
          "taskIdentifier": "c6ba-1012",
          "taskTime": {
            "value": 0,
            "unit": "SECONDS"
          },
          "taskLocationIdentifier": "3438"
        },
        {
          "taskIdentifier": "cc48-07bf",
          "taskTime": {
            "value": 0,
            "unit": "SECONDS"
          },
          "taskLocationIdentifier": "4b32"
        },
        {
          "taskIdentifier": "35b896",
          "taskTime": {
            "value": 0,
            "unit": "SECONDS"
          },
          "taskLocationIdentifier": "7bcca"
        }
      ],
      "sectionB": {
        "someCapacityA": {
          "totalVolume": {
            "value": 8415,
            "unit": "CU_CM"
          },
          "totalWeight": {
            "value": 1059.0,
            "unit": "GRAMS"
          }
        },
        "someCapacityB": {
          "totalVolume": {
            "value": 0.0,
            "unit": "CU_CM"
          },
          "totalWeight": {
            "value": 0.0,
            "unit": "GRAMS"
          }
        }
      },
 "sectionC": {....},
"sectionD": {....}, 
"sectonE":{...}
}]}

I can create a base for the merges by doing this:

with open('my_file.json') as simulation_file:    
    simulation_data = json.load(simulation_file) 
df_base = pd.json_normalize(simulation_data, 'activities',['topLevelIdentifier'])

Then normalizing the next part isn't too bad:

df_taskList = pd.json_normalize(simulation_data['activities'],
                               'taskList',['activityId'])

It's normalizing the next section that has me stuck. Here's what I'm trying but is throwing an error:

df_sectionB = pd.json_normalize(simulation_data['activities'],
                                
                                'activityId',['taskList',['taskIdentifier',['taskTime',['unit', 'value']], 'taskLocationIdentifier']])

I'm basically trying to flatten out sectionBsuch that it has the activityId so that I can merge all the pieces together.

If there's a simpler way to flatten out to the deepest layer in a complex format like this, I'm open to alternative approaches!

1
  • Perhaps you could df_base = pd.json_normalize(simulation_data, 'activities', 'topLevelIdentifier').explode('taskList') - which would allow you to then pd.json_normalize(df_base['taskList']).set_index(df_base.index) Commented Feb 21, 2023 at 0:46

2 Answers 2

1

You could add .explode("taskList") to the result of the first normalize.

You can then normalize the taskList column, copy the base index, and concat the results.

df_base = pd.json_normalize(
   simulation_data, 
   record_path="activities", 
   meta="topLevelIdentifier"
).explode("taskList")

df = pd.concat(
   [ 
      df_base.drop(columns="taskList"),
      pd.json_normalize(df_base["taskList"]).set_index(df_base.index)
   ],
   axis=1
)
>>> df.T
                                                   0           0           0
activityId                                  1f13d848    1f13d848    1f13d848
sectionC                                  {Ellipsis}  {Ellipsis}  {Ellipsis}
sectionD                                  {Ellipsis}  {Ellipsis}  {Ellipsis}
sectionE                                  {Ellipsis}  {Ellipsis}  {Ellipsis}
sectionB.someCapacityA.totalVolume.value        8415        8415        8415
sectionB.someCapacityA.totalVolume.unit        CU_CM       CU_CM       CU_CM
sectionB.someCapacityA.totalWeight.value      1059.0      1059.0      1059.0
sectionB.someCapacityA.totalWeight.unit        GRAMS       GRAMS       GRAMS
sectionB.someCapacityB.totalVolume.value         0.0         0.0         0.0
sectionB.someCapacityB.totalVolume.unit        CU_CM       CU_CM       CU_CM
sectionB.someCapacityB.totalWeight.value         0.0         0.0         0.0
sectionB.someCapacityB.totalWeight.unit        GRAMS       GRAMS       GRAMS
topLevelIdentifier                            a86507      a86507      a86507
taskIdentifier                             c6ba-1012   cc48-07bf      35b896
taskLocationIdentifier                          3438        4b32       7bcca
taskTime.value                                     0           0           0
taskTime.unit                                SECONDS     SECONDS     SECONDS
Sign up to request clarification or add additional context in comments.

Comments

0

You can try flatten_json. I had to remove Sections C, D, E from your example, so it' s unclear if this will work as you need it to. Refer to documentation for more information.

set your dictionary equal to test_json

from flatten_json import flatten

dic_flattened = (flatten(d, '.') for d in [test_json]) 'note test_json needs to be in square brackets or you'll get an error
df = pd.DataFrame(dic_flattened)
df

  topLevelIdentifier activities.0.activityId  ... activities.0.sectionB.someCapacityB.totalWeight.value  activities.0.sectionB.someCapacityB.totalWeight.unit
0             a86507                1f13d848  ...                                               0.00000                                                 GRAMS

[1 rows x 22 columns]

' or transpose it to see the columns better
df.T

                                                               0
topLevelIdentifier                                        a86507
activities.0.activityId                                 1f13d848
activities.0.taskList.0.taskIdentifier                 c6ba-1012
activities.0.taskList.0.taskTime.value                         0
activities.0.taskList.0.taskTime.unit                    SECONDS
activities.0.taskList.0.taskLocationIdentifier              3438
activities.0.taskList.1.taskIdentifier                 cc48-07bf
activities.0.taskList.1.taskTime.value                         0
activities.0.taskList.1.taskTime.unit                    SECONDS
activities.0.taskList.1.taskLocationIdentifier              4b32
activities.0.taskList.2.taskIdentifier                    35b896
activities.0.taskList.2.taskTime.value                         0
activities.0.taskList.2.taskTime.unit                    SECONDS
activities.0.taskList.2.taskLocationIdentifier             7bcca
activities.0.sectionB.someCapacityA.totalVolume.value       8415
activities.0.sectionB.someCapacityA.totalVolume.unit       CU_CM
activities.0.sectionB.someCapacityA.totalWeight.value 1059.00000
activities.0.sectionB.someCapacityA.totalWeight.unit       GRAMS
activities.0.sectionB.someCapacityB.totalVolume.value    0.00000
activities.0.sectionB.someCapacityB.totalVolume.unit       CU_CM
activities.0.sectionB.someCapacityB.totalWeight.value    0.00000
activities.0.sectionB.someCapacityB.totalWeight.unit       GRAMS

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.