Using json_normalize with pandas

Question

I'm using the following code in python to flatten the json structure below, however it doesn't work for all levels. I'm interested in the tags.tags column data specifically shown on the picture below

pd.json_normalize(data['conversations'])
pd.set_option('display.max_columns', None)

Json structure:

[{'type': 'conversation.list',
  'pages': {'type': 'pages',
   'next': {'page': 3,
    'starting_after': 'WzE3MTU3ODIxOTc=='},
   'page': 2,
   'per_page': 1,
   'total_pages': 46969},
  'total_count': 46969,
  'conversations': [{'type': 'conversation',
    'id': '1384780',
    'created_at': 1715780970,
    'updated_at': 1715782197,
    'waiting_since': None,
    'snoozed_until': None,
    'source': {'type': 'conversation',
     'id': '2197597651',
     'delivered_as': 'customer_initiated',
     'subject': '',
     'body': '<p>Outros</p>',
     'author': {'type': 'user',
      'id': '64ac5cacccd2b047',
      'name': 'Claudiney Pinho',
      'email': '[email protected]'},
     'attachments': [],
     'url': None,
     'redacted': False},
    'contacts': {'type': 'contact.list',
     'contacts': [{'type': 'contact',
       'id': '64accccd71982047',
       'external_id': 'b363-cc8f--fb270b5e72e8'}]},
    'first_contact_reply': {'created_at': 1715780970,
     'type': 'conversation',
     'url': None},
    'admin_assignee_id': 5614527,
    'team_assignee_id': 5045796,
    'open': False,
    'state': 'closed',
    'read': True,
    'tags': {'type': 'tag.list',
     'tags': [{'type': 'tag',
       'id': '5379642',
       'name': '[BOT] Other',
       'applied_at': 1715781024,
       'applied_by': {'type': 'admin', 'id': '4685750'}},
      {'type': 'tag',
       'id': '5379660',
       'name': '[BOT] Connected Agent',
       'applied_at': 1715781025,
       'applied_by': {'type': 'admin', 'id': '4685750'}},
      {'type': 'tag',
       'id': '5379654',
       'name': '[BOT] Not Resolved',
       'applied_at': 1715781027,
       'applied_by': {'type': 'admin', 'id': '4685750'}},
      {'type': 'tag',
       'id': '7046337',
       'name': '[BOT] Portuguese',
       'applied_at': 1715781029,
       'applied_by': {'type': 'admin', 'id': '4685750'}}]},
    'priority': 'not_priority',
    'sla_applied': None,
    'statistics': {'type': 'conversation_statistics',
     'time_to_assignment': 0,
     'time_to_admin_reply': 189,
     'time_to_first_close': 1158,
     'time_to_last_close': 1228,
     'median_time_to_reply': 139,
     'first_contact_reply_at': 1715780970,
     'first_assignment_at': 1715780970,
     'first_admin_reply_at': 1715781159,
     'first_close_at': 1715782128,
     'last_assignment_at': 1715781159,
     'last_assignment_admin_reply_at': 1715781159,
     'last_contact_reply_at': 1715782179,
     'last_admin_reply_at': 1715782125,
     'last_close_at': 1715782198,
     'last_closed_by_id': 5614527,
     'count_reopens': 1,
     'count_assignments': 3,
     'count_conversation_parts': 28},
    'conversation_rating': None,
    'teammates': {'type': 'admin.list',
     'admins': [{'type': 'admin', 'id': '5614527'}]},
    'title': None,
    'custom_attributes': {'Language': 'Portuguese',
     'Conversation status': 'Open',
     'From': 'iOS / Android'},
    'topics': {'type': 'topic.list', 'topics': [], 'total_count': 0},
    'ticket': None,
    'linked_objects': {'type': 'list',
     'data': [],
     'total_count': 0,
     'has_more': False}}]}]

how the dataset is displayed

I wish to create a pandas data frame with flatten key:values in a table

Your JSON data is invalid, has several unquoted strings like None, True, False. Check validity first, maybe it's just that. — Pepe N O
– Pepe N O, Commented Sep 13, 2024 at 15:27
@PepeNO This is JSON-compatible Python data, not JSON itself. — wjandrea
– wjandrea, Commented Sep 13, 2024 at 16:41
Welcome to Stack Overflow! Please take the tour. SO is a site for specific questions, but there's no question here and the problem is a little vague. Please read How to Ask and make a minimal reproducible example including desired output dataframe. See also How to make good reproducible pandas examples. If you could reduce the size of the input data, that'd be great. — wjandrea
– wjandrea, Commented Sep 13, 2024 at 16:47
Also, please don't post pictures of text. Instead, copy the text itself, edit it into your post, and use the formatting tools like code formatting (print(df)) or table formatting (print(df.to_markdown()). — wjandrea
– wjandrea, Commented Sep 13, 2024 at 16:48

D.lola · Accepted Answer · 2024-09-13 15:33:50Z

You can try using pd.json_normalize function with the record_path and meta parameters. This should flatten the JSON to different multiple levels of your choice, you can then further extract each nested lists or dictionaries into columns, which is of interest to you.

I have provided a code snippet for you, I am not using jupyter so you might need to test it but in theory this should work nicely, since the concept is true.

import pandas as pd

df = pd.json_normalize(
    data[0]["conversations"],
    record_path=["tags", "tags"],
    meta=[
        "id",
        "created_at",
        "updated_at",
        ["source", "id"],
        ["source", "author", "name"],
    ],
    meta_prefix="meta_",  # trying to avoid conflicts with available ids 
    errors="ignore",
)
pd.set_option("display.max_columns", None)

Also note you have many NaN which caused errors, also I had to use meta_prefix for id.

  applied_by.id  meta_id meta_created_at meta_updated_at meta_source.id  \
0       4685750  1384780      1715780970      1715782197            NaN   
1       4685750  1384780      1715780970      1715782197            NaN   
2       4685750  1384780      1715780970      1715782197            NaN   
3       4685750  1384780      1715780970      1715782197            NaN   

  meta_source.author.name  
0                     NaN  
1                     NaN  
2                     NaN  
3                     NaN

Simon · Accepted Answer · 2024-09-18 16:28:45Z

0

D.lola, it works however I tried to loop over the indices and append the results to other variable. I came up with the error: "list indices must be integers or slices, not dict"

It works if I run 'all_data[0]["conversations"]' without 'i += 1'

    data_1 = []  # This will store the normalized data
    i=0
    
    for i in all_data:  # Iterating directly over items in all_data
        normalized_data = pd.json_normalize(
            all_data[i]["conversations"],
            record_path=["tags", "tags"],
            meta=[
                "id",
                "created_at",
                "updated_at",
                ["source", "id"],
                ["source", "author", "name"],
            ],
            meta_prefix="meta_",  # Avoiding conflicts with available ids 
            errors="ignore",
        )
        
        pd.set_option("display.max_columns", None)
        
        # Append the normalized data to date_1
        data_1.append(normalized_data)
        
        i += 1
    
    # If you want to combine all DataFrames into one:
    if data_1:  # Check if date_1 is not empty
        final_data = pd.concat(data_1, ignore_index=True)

edited Sep 18, 2024 at 16:28

Simon

1,2096 silver badges19 bronze badges

answered Sep 16, 2024 at 21:22

lpca

153 bronze badges

2 Comments

D.lola Over a year ago

Hi Lucas, i becomes each element in all_data so it's a dict, not an index. does i contain each item you want in all_data? I am guessing you want to normalise i as well. You can try replacing all_data[i] with just i let's see what happens and also no need for this i += 1 at the end. I hope it helps

lpca Over a year ago

It worked. Thank you

Collectives™ on Stack Overflow

Using json_normalize with pandas

2 Answers 2

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related