1

I'm trying to unnest congressional data here: https://theunitedstates.io/congress-legislators/legislators-historical.json

Sample structure:

    {
    "id": {
      "bioguide": "B000226",
      "govtrack": 401222,
      "icpsr": 507,
      "wikipedia": "Richard Bassett (politician)",
      "wikidata": "Q518823",
      "google_entity_id": "kg:/m/02pz46"
    },
    "name": {
      "first": "Richard",
      "last": "Bassett"
    },
    "bio": {
      "birthday": "1745-04-02",
      "gender": "M"
    },
    "terms": [
      {
        "type": "sen",
        "start": "1789-03-04",
        "end": "1793-03-03",
        "state": "DE",
        "class": 2,
        "party": "Anti-Administration"
      }
    ]
  }

If I just use json_normalize(data), the "terms" don't unnest.

If I try to unnest the terms specifically, like json_normalize(data, 'terms', 'name'), then whatever else I include (here the names) stays in dict format with {u'last': u'Bassett', u'first': u'Richard'} as the row entry.

Full current code, if you want to run it:

import json
import urllib
import pandas as pd
from pandas.io.json import json_normalize

# load data
url = "https://theunitedstates.io/congress-legislators/legislators-historical.json"
json_url = urllib.urlopen(url)
data = json.loads(json_url.read())

# parse
congress_names = json_normalize(data, record_path='terms',meta='name')
2
  • Can you explain/show what kind of output you're trying to get? Commented Jan 9, 2020 at 23:10
  • looking at the json, i think you need to make several seperate dfs, it will be very long, for each one it has a unique schema which you need to infer. Commented Jan 10, 2020 at 0:26

2 Answers 2

1

I think following code should work. There may be a better way to normalize, but I am unaware.

import requests
import pandas as pd
import re
import json
from pandas.io.json import json_normalize

url = ' https://theunitedstates.io/congress-legislators/legislators-historical.json'
resp = requests.get(url)
raw_dict = json.loads(resp.text)

df = pd.DataFrame()
for i in range(len(raw_dict)):    
     df1 = json_normalize(raw_dict[i], record_path = ['terms'], meta = ['name'])
     df1 = pd.concat([df1, df1['name'].apply(pd.Series)], axis=1)
     df = pd.concat([df,df1], axis=0, ignore_index =True, sort=True)
Sign up to request clarification or add additional context in comments.

Comments

1

When you specify terms as rec_path, you need to specify a list of path for the rest of columns as meta. Use list comprehension to build such list as follows

from pandas.io import json

l_meta = [[k, k1]  for k in data[0] if k != 'terms' for k1 in data[0][k]]
congress_names = json.json_normalize(data, 'terms', l_meta, errors='ignore')

Out[1105]:
  type       start         end state  class                party  district  \
0  sen  1789-03-04  1793-03-03    DE    2.0  Anti-Administration       NaN
1  rep  1789-03-04  1791-03-03    VA    NaN                  NaN       9.0

  id.bioguide id.govtrack id.icpsr                    id.wikipedia  \
0     B000226      401222      507    Richard Bassett (politician)
1     B000546      401521      786  Theodorick Bland (congressman)

  id.wikidata id.google_entity_id  name.first name.last bio.birthday  \
0     Q518823        kg:/m/02pz46     Richard   Bassett   1745-04-02
1    Q1749152        kg:/m/033mf4  Theodorick     Bland   1742-03-21

  bio.gender
0          M
1          M

Note: I pick only first 2 elements/objects from data for this testing purpose. I also assume 1st elements (data[0]) has all columns.


Method 2:

normalize whole data as main congress_names. After that slice only column terms and normalize it to a new df1 and join back

congress_names = json.json_normalize(data)
df1 = json.json_normalize(congress_names.terms.str[0])
congress_names = congress_names.join(df1).drop('terms', axis=1)

Out[1130]:
  id.bioguide  id.govtrack  id.icpsr                    id.wikipedia  \
0     B000226       401222       507    Richard Bassett (politician)
1     B000546       401521       786  Theodorick Bland (congressman)

  id.wikidata id.google_entity_id  name.first name.last bio.birthday  \
0     Q518823        kg:/m/02pz46     Richard   Bassett   1745-04-02
1    Q1749152        kg:/m/033mf4  Theodorick     Bland   1742-03-21

  bio.gender  id.house_history type       start         end state  class  \
0          M               NaN  sen  1789-03-04  1793-03-03    DE    2.0
1          M            9479.0  rep  1789-03-04  1791-03-03    VA    NaN

                 party  district
0  Anti-Administration       NaN
1                  NaN       9.0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.