I have a data structure which is a very complex/nested JSON.
After reading other solutions on SO, I am comfortable with the fact that I may need to make multiple DFs and merge them together (e.g json_normalize with multiple record paths).
{
"topLevelIdentifier": "a86507",
"activities": [
{
"activityId": "1f13d848",
"taskList": [
{
"taskIdentifier": "c6ba-1012",
"taskTime": {
"value": 0,
"unit": "SECONDS"
},
"taskLocationIdentifier": "3438"
},
{
"taskIdentifier": "cc48-07bf",
"taskTime": {
"value": 0,
"unit": "SECONDS"
},
"taskLocationIdentifier": "4b32"
},
{
"taskIdentifier": "35b896",
"taskTime": {
"value": 0,
"unit": "SECONDS"
},
"taskLocationIdentifier": "7bcca"
}
],
"sectionB": {
"someCapacityA": {
"totalVolume": {
"value": 8415,
"unit": "CU_CM"
},
"totalWeight": {
"value": 1059.0,
"unit": "GRAMS"
}
},
"someCapacityB": {
"totalVolume": {
"value": 0.0,
"unit": "CU_CM"
},
"totalWeight": {
"value": 0.0,
"unit": "GRAMS"
}
}
},
"sectionC": {....},
"sectionD": {....},
"sectonE":{...}
}]}
I can create a base for the merges by doing this:
with open('my_file.json') as simulation_file:
simulation_data = json.load(simulation_file)
df_base = pd.json_normalize(simulation_data, 'activities',['topLevelIdentifier'])
Then normalizing the next part isn't too bad:
df_taskList = pd.json_normalize(simulation_data['activities'],
'taskList',['activityId'])
It's normalizing the next section that has me stuck. Here's what I'm trying but is throwing an error:
df_sectionB = pd.json_normalize(simulation_data['activities'],
'activityId',['taskList',['taskIdentifier',['taskTime',['unit', 'value']], 'taskLocationIdentifier']])
I'm basically trying to flatten out sectionBsuch that it has the activityId so that I can merge all the pieces together.
If there's a simpler way to flatten out to the deepest layer in a complex format like this, I'm open to alternative approaches!
df_base = pd.json_normalize(simulation_data, 'activities', 'topLevelIdentifier').explode('taskList')- which would allow you to thenpd.json_normalize(df_base['taskList']).set_index(df_base.index)