0

I am parsing nested JSON data from here. Some of the files within this file have more than one committee_id associated with them. I need all of the committees associated with each file. I'm not sure, but I imagine that would mean writing a new row for each committee_id. My code follows:

import os.path
import csv
import json

path = '/home/jayaramdas/anaconda3/Thesis/govtrack/bills109/hr'
dirs = os.listdir(path)
outputfile = open('df/h109_s_b', 'w', newline='')                            
outputwriter = csv.writer(outputfile)

for dir in dirs:
    with open(path + "/" + dir + "/data.json", "r") as f:
        data = json.load(f)

        a = data['introduced_at']
        b = data['bill_id']
        c = data['sponsor']['thomas_id']
        d = data['sponsor']['state']
        e = data['sponsor']['name']
        f = data['sponsor']['type']
        i = data['subjects_top_term']   
        j = data['official_title']               

        if data['committees']:
            g = data['committees'][0]['committee_id']
        else:
            g = "None"                      
    outputwriter.writerow([a, b, c, d, e, f, g, i, j])
outputfile.close()       

The problem I am having is that my code is only collecting the first committee_id listed. For example, file hr145 looks like this:

 "committees": [
{
  "activity": [
    "referral", 
    "in committee"
  ], 
  "committee": "House Transportation and Infrastructure", 
  "committee_id": "HSPW"
}, 
{
  "activity": [
    "referral"
  ], 
  "committee": "House Transportation and Infrastructure", 
  "committee_id": "HSPW", 
  "subcommittee": "Subcommittee on Economic Development, Public Buildings and Emergency Management", 
  "subcommittee_id": "13"
}, 
{
  "activity": [
    "referral", 
    "in committee"
  ], 
  "committee": "House Financial Services", 
  "committee_id": "HSBA"
}, 
{
  "activity": [
    "referral"
  ], 
  "committee": "House Financial Services", 


  "committee_id": "HSBA", 
  "subcommittee": "Subcommittee on Domestic and International Monetary Policy, Trade, and Technology", 
  "subcommittee_id": "19"
}

This is where it is a little bit tricky because I also want the subcommittee_id associated with the committee_id when the bill gets passed to a subcommittee:

bill_iid    committee   subcommittee    introduced at   Thomas_id   state   name
hr145-109   HSPW          na             "2005-01-4"         73      NY "McHugh, John M."
hr145-109   HSPW          13             "2005-01-4"         73      NY "McHugh, John M."
hr145-109   HSBA          na             "2005-01-4"         73      NY "McHugh, John M."
hr145-109   HSBA          19             "2005-01-4"         73      NY "McHugh, John M."

Any ideas?

1 Answer 1

1

you can do it this way:

In [111]: with open(fn) as f:
   .....:     data = ujson.load(f)
   .....:

In [112]: committees = pd.io.json.json_normalize(data, 'committees')

In [113]: committees
Out[113]:
             activity                                committee committee_id                            subcommittee subcommittee_id
0          [referral]                House Energy and Commerce         HSIF                                     NaN             NaN
1          [referral]                House Energy and Commerce         HSIF  Subcommittee on Energy and Air Quality              03
2          [referral]        House Education and the Workforce         HSED                                     NaN             NaN
3          [referral]                 House Financial Services         HSBA                                     NaN             NaN
4          [referral]                        House Agriculture         HSAG                                     NaN             NaN
5  [referral, markup]                          House Resources         HSII                                     NaN             NaN
6          [referral]                            House Science         HSSY                                     NaN             NaN
7          [referral]                     House Ways and Means         HSWM                                     NaN             NaN
8          [referral]  House Transportation and Infrastructure         HSPW                                     NaN             NaN

UPDATE:

if you want to have all your data in one DF you can do it this way:

import os
import ujson
import pandas as pd

start_path = '/home/jayaramdas/anaconda3/Thesis/govtrack/bills109/hr'

def get_merged_json(start_path):
    return [ujson.load(open(os.path.join(path, f)))
            for p, _, files in os.walk(start_path)
            for f in files
            if f.endswith('.json')
           ]

df = pd.read_json(ujson.dumps(data))

PS it will put all committees in one column as JSON data though

Sign up to request clarification or add additional context in comments.

6 Comments

Thanks again MaxU! I have a small question: what should fnbe pointing to? Wait, I think I got it. fn= filename.
@MichaelPerdue, yes, it should be full or relative path to your file including its name
I have applied your code with one exception. I have substituted json for ujson, as I was getting a NameError: name 'ujson' is not defined . However, it is only returning one row. As fn I am using (path + "/" + dir + "/data.json", "r") I can probably tool around with it to get it working, but would you have an idea of what that is?
@MichaelPerdue, the number of rows will vary depending of number of elements in the committees list in each file
@MichaelPerdue, i've updated my answer - please check. I would also open a new question about how to expand a JSON column into multiple columns, because it might be tricky
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.