1

I am trying to create a dataframe from nested json file but running into trouble.

[
    {
        "rec_id": "1",
        "user": {
            "id": "12414",
            "name": "Steve"
        },
        "addresses": [
            {
                "address_type": "Home",
                "street1": "100 Main St",
                "street2": null,
                "city": "Chicago"
            },
            {
                "address_type": "Work",
                "street1": "100 Main St",
                "street2": null,
                "city": "Chicago"
            }
        ],
        "timestamp": "2023-07-28T20:05:14.859000+00:00",
        "edited_timestamp": null
    },
    {
        "rec_id": "2",
        "user": {
            "id": "214521",
            "name": "Tim"
        },
        "addresses": [
            {
                "address_type": "Home",
                "street1": "100 Main St",
                "street2": null,
                "city": "Boston"
            },
            {
                "address_type": "Work",
                "street1": "100 Main St",
                "street2": null,
                "city": "Boston"
            }
        ],
        "timestamp": "2023-07-28T20:05:14.859000+00:00",
        "edited_timestamp": null
    },
    {
        "rec_id": "3",
        "user": {
            "id": "12121",
            "name": "Jack"
        },
        "addresses": [
            {
                "address_type": "Home",
                "street1": "100 Main St",
                "street2": null,
                "city": "Las Vegas"
            } ]
        "timestamp": "2023-07-28T20:05:14.859000+00:00",
        "edited_timestamp": null
    }
]

I tried below:

with open("employee.json") as file:
    data = json.load(file)  

data_df = pd.json_normalize(data)

data_df.columns.values.tolist()

['rec_id',
 'addresses',
 'timestamp',
 'edited_timestamp',
 'user.id',
 'user.name']

display(data_df)
    rec_id  addresses   timestamp   edited_timestamp    user.id user.name
0   1   [{'address_type': 'Home', 'street1': '100 Main St', 'street2': None, 'city': 'Chicago'}, {'address_type': 'Work', 'street1': '100 Main St', 'street2': None, 'city': 'Chicago'}]    2023-07-28T20:05:14.859000+00:00    None    12414   Steve
1   2   [{'address_type': 'Home', 'street1': '100 Main St', 'street2': None, 'city': 'Boston'}, {'address_type': 'Work', 'street1': '100 Main St', 'street2': None, 'city': 'Boston'}]  2023-07-28T20:05:14.859000+00:00    None    214521  Tim
2   3   [{'address_type': 'Home', 'street1': '100 Main St', 'street2': None, 'city': 'Las Vegas'}]  2023-07-28T20:05:14.859000+00:00    None    12121   Jack

How do I get the output as below -

rec_id  addresses.address_type  addresses.street1   addresses.street2   addresses.city  timestamp   edited_timestamp    user.id user.name
0   1   Home    100 Main St None    Chicago 2023-07-28T20:05:14.859000+00:00    None    12414   Steve
1   1   Work    100 Main St None    Chicago 2023-07-28T20:05:14.859000+00:00    None    12414   Steve
2   2   Home    100 Main St None    Boston  2023-07-28T20:05:14.859000+00:00    None    214521  Tim
3   2   Work    100 Main St None    Boston  2023-07-28T20:05:14.859000+00:00    None    214521  Tim
4   3   Home    100 Main St None    Las Vegas   2023-07-28T20:05:14.859000+00:00    None    12121   Jack

2 Answers 2

1

Try this, more about .json_normalize

df = pd.json_normalize(data, 
                       record_path='addresses', 
                       meta=['rec_id', ["user", "id"], ["user", "name"], 'timestamp', 'edited_timestamp'])

Output:

address_type street1 street2 city rec_id user.id user.name timestamp edited_timestamp
0 Home 100 Main St Chicago 1 12414 Steve 2023-07-28T20:05:14.859000+00:00
1 Work 100 Main St Chicago 1 12414 Steve 2023-07-28T20:05:14.859000+00:00
2 Home 100 Main St Boston 2 214521 Tim 2023-07-28T20:05:14.859000+00:00
3 Work 100 Main St Boston 2 214521 Tim 2023-07-28T20:05:14.859000+00:00
4 Home 100 Main St Las Vegas 3 12121 Jack 2023-07-28T20:05:14.859000+00:00
Sign up to request clarification or add additional context in comments.

Comments

0

Try:

data_df = pd.json_normalize(
    data,
    meta=["rec_id", "timestamp", "edited_timestamp", ["user", "id"], ["user", "name"]],
    record_path=["addresses"],
    record_prefix="addresses.",
)
print(data_df)

Prints:

  addresses.address_type addresses.street1 addresses.street2 addresses.city rec_id                         timestamp edited_timestamp user.id user.name
0                   Home       100 Main St              None        Chicago      1  2023-07-28T20:05:14.859000+00:00             None   12414     Steve
1                   Work       100 Main St              None        Chicago      1  2023-07-28T20:05:14.859000+00:00             None   12414     Steve
2                   Home       100 Main St              None         Boston      2  2023-07-28T20:05:14.859000+00:00             None  214521       Tim
3                   Work       100 Main St              None         Boston      2  2023-07-28T20:05:14.859000+00:00             None  214521       Tim
4                   Home       100 Main St              None      Las Vegas      3  2023-07-28T20:05:14.859000+00:00             None   12121      Jack

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.