5

I have now added the current problem onto GitHib. Please find the URL for the repo. I have included a Jupyter notebook that also explains the problem. Thanks guys.

https://github.com/simongraham/dataExplore.git


I am currently working with nutritional data for a project, where the data is in raw JSON format, and I want to use python and pandas to obtain an understandable data frame. I understand that this is an easy task when the JSON is not nested. Here I would use:

nutrition = pd.read_json('data')

However I have nested information and I am finding it very difficult to get it into a reasonable data frame. The JSON format is as follows, where the nutritionNutrients element itself is a nested element. The nest for this element will describe the nutritional contents for a variety of different things such as alcohol and bcfa, as included. I have only included a sample because this is a large data file.

  [
        {
            "vcNutritionPortionId": "478d1905-f264-4d9b-ab76-0ed4252193fd",
            "vcNutritionId": "2476378b-79ee-4857-a81d-489661a039a1",
            "vcUserId": "cc51145b-5a70-4344-9b55-1a4455f0a9d2",
            "vcPortionId": "1",
            "vcPortionName": "1 average pepper",
            "vcPortionSize": "20",
            "ftEnergyKcal": 5.2,
            "vcPortionUnit": "g",
            "dtConsumedDate": "2016-05-04T00:00:00",
            "nutritionNutrients": [
                {
                    "vcNutritionPortionId": "478d1905-f264-4d9b-ab76-0ed4252193fd",
                    "vcNutrient": "alcohol",
                    "ftValue": 0,
                    "vcUnit": "g",
                    "nPercentRI": 0,
                    "vcTrafficLight": ""
                },
                {
                    "vcNutritionPortionId": "478d1905-f264-4d9b-ab76-0ed4252193fd",
                    "vcNutrient": "bcfa",
                    "ftValue": 0,
                    "vcUnit": "g",
                    "nPercentRI": 0,
                    "vcTrafficLight": ""
                },
                {
                    "vcNutritionPortionId": "478d1905-f264-4d9b-ab76-0ed4252193fd",
                    "vcNutrient": "biotin",
                    "ftValue": 0,
                    "vcUnit": "µg",
                    "nPercentRI": 0,
                    "vcTrafficLight": ""
                },
                ...
            ]
        }
    ]

Any help would be appreciated.

Thanks.

.... ....

Now that I have found out how to solve this problem using json_normalize, I return the same problem, but this time my code is nested twice. Ie:

[
{
...
}
[,
"nutritionPortions": [
    {
        "vcNutritionPortionId": "478d1905-f264-4d9b-ab76-0ed4252193fd",
        "vcNutritionId": "2476378b-79ee-4857-a81d-489661a039a1",
        "vcUserId": "cc51145b-5a70-4344-9b55-1a4455f0a9d2",
        "vcPortionId": "1",
        "vcPortionName": "1 average pepper",
        "vcPortionSize": "20",
        "ftEnergyKcal": 5.2,
        "vcPortionUnit": "g",
        "dtConsumedDate": "2016-05-04T00:00:00",
        "nutritionNutrients": [
            {
                "vcNutritionPortionId": "478d1905-f264-4d9b-ab76-0ed4252193fd",
                "vcNutrient": "alcohol",
                "ftValue": 0,
                "vcUnit": "g",
                "nPercentRI": 0,
                "vcTrafficLight": ""
            },
            {
                "vcNutritionPortionId": "478d1905-f264-4d9b-ab76-0ed4252193fd",
                "vcNutrient": "bcfa",
                "ftValue": 0,
                "vcUnit": "g",
                "nPercentRI": 0,
                "vcTrafficLight": ""
            },
            {
                "vcNutritionPortionId": "478d1905-f264-4d9b-ab76-0ed4252193fd",
                "vcNutrient": "biotin",
                "ftValue": 0,
                "vcUnit": "µg",
                "nPercentRI": 0,
                "vcTrafficLight": ""
            },
            ...
           }
          ]
        }
      ]

When I have a JSON consisting of only nutrition data I can use:

nutrition = (pd.io
   .json
   .json_normalize((data, ['nutritionPortions']), 'nutritionNutrients',
        ['vcNutritionId','vcUserId','vcPortionId','vcPortionName','vcPortionSize',
         'ftEnergyKcal','vcPortionUnit','dtConsumedDate'])
)

However, my data does not only contain nutrition information. For example it will contain activity information, and therefore the nutrition information is nested with "nutrtitionPortions" at the start. Let's assume that all other columns are not nested and they are represented by "Activity" and "Wellbeing".

If I use the code:

nutrition = (pd.io
   .json
   .json_normalize(data, ['nutritionPortions'])
)

I will return to the original problem where "nutritionNutrients" is nested, but I am having no success then obtaining the corresponding data frame.

Thanks

1 Answer 1

4

UPDATE: this should work for your kaidoData.json file:

df = (pd.io
        .json
        .json_normalize(data[0]['ionPortions'], 'nutritionNutrients',
            ['vcNutritionId','vcUserId','vcPortionId','vcPortionName','vcPortionSize',
             'dtCreatedDate','dtUpdatedDate','nProcessingStatus',
             'vcPortionUnit','dtConsumedDate'
            ]
        )
)

PS i don't know what is wrong with the 'ftEnergyKcal' - it throws me:

KeyError: 'ftEnergyKcal'

maybe it's missing in some sections

OLD answer:

use json_normalize():

(pd.io
   .json
   .json_normalize(l, 'nutritionNutrients',
        ['vcNutritionId','vcUserId','vcPortionId','vcPortionName','vcPortionSize',
         'ftEnergyKcal','vcPortionUnit','dtConsumedDate'])
)

demo:

In [107]: (pd.io
   .....:    .json
   .....:    .json_normalize(l, 'nutritionNutrients',
   .....:         ['vcNutritionId','vcUserId','vcPortionId','vcPortionName','vcPortionSize',
   .....:          'ftEnergyKcal','vcPortionUnit','dtConsumedDate'])
   .....: )
Out[107]:
   ftValue  nPercentRI vcNutrient vcNutritionPortionId vcTrafficLight        ...        vcPortionSize  \
0        0           0    alcohol  478d1905-f264-4d...                       ...                   20
1        0           0       bcfa  478d1905-f264-4d...                       ...                   20
2        0           0     biotin  478d1905-f264-4d...                       ...                   20

         vcNutritionId vcPortionId ftEnergyKcal     vcPortionName
0  2476378b-79ee-48...           1          5.2  1 average pepper
1  2476378b-79ee-48...           1          5.2  1 average pepper
2  2476378b-79ee-48...           1          5.2  1 average pepper

[3 rows x 14 columns]

where l is your list (parsed JSON)

Sign up to request clarification or add additional context in comments.

5 Comments

I don't suppose you can give me a hand with my latest edit of the current problem. Thanks :)
@simongraham, i think you would have to pass something like data['nutritionPortions'] instead of data to the json_normalize
I'm afraid I already tried this, but with no luck unfortunately @MaxU. I guess I need a way of gaining the subset of the raw JSON data
@simongraham, i would recommend you to open another question and put there a working sample of your data. For example: data = [...], which can be executed in Python without errors. Alternatively you can upload your data as file somewhere and post a link to this file in your question...
I have added the relevant data to github.com/simongraham/dataExplore.git . I have also added a Jupiter notebook. Feel free to play around :) @MaxU

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.