How to extract values from nested JSON array using pandas

Question

I have a large JSON file (400k lines). I am trying to isolate the following:

Policies- "description"

policy items - "users" and "database values"

JSON FILE - https://pastebin.com/hv8mLfgx

Expected Output from Pandas: https://i.sstatic.net/enkx6.jpg

Everything after "Policy Items" is re-iterated the exact same throughout the file. I have tried the code below to isolate "users". It doesn't seem to work, I'm trying to dump all of this into a CSV.

Edit* here was a solution I was attempting to try, but could not get this to work correctly - Deeply nested JSON response to pandas dataframe

from pandas.io.json import json_normalize as Jnormal
import json
import pprint, csv
import re

with open("Ranger_Policies_20190204_195010.json") as file:
    jsonDF = json.load(file)
    for item in jsonDF['policies'][0]['policyItems'][0]:
        print ('{} - {} - {}'.format(jsonDF['users']))

EDIT 2: I have some working code which is able to grab some of the USERS, but it does not grab all of them. Only 11 out of 25.

from pandas.io.json import json_normalize as Jnormal
import json
import pprint, csv
import re

with open("Ranger_Policies_20190204_195010.json") as file:
    jsonDF = json.load(file)
    pNode = Jnormal(jsonDF['policies'][0]['policyItems'], record_path='users')
    print(pNode.head(500))

EDIT 3: This is the Final working copy, however I am still not copying over all my TABLE data. I set a loop to simply ignore everything. Capture everything and I'd sort it in Excel, Does anyone have any ideas why I cannot capture all the TABLE values?

    json_data = json.load(file)
    with open("test.csv", 'w', newline='') as fd:
        wr = csv.writer(fd)
        wr.writerow(('Database name', 'Users', 'Description', 'Table'))
        for policy in json_data['policies']:
            desc = policy['description']
            db_values = policy['resources']['database']['values']
            db_tables = policy['resources']['table']['values']
            for item in policy['policyItems']:
                users = item['users']
                for dbT in db_tables:
                    for user in users:
                        for db in db_values:
                            _ = wr.writerow((db, user, desc, dbT))```

I've thought about building some recursion into this with mapping. I personally don't even know where to start since I am a Beginner in Python. Any advice or direction would be appreciated. — Jaz
– Jaz, Commented Feb 12, 2019 at 18:07
so you want something that maps a description to the users? and policies is just a big list and you want to perform that operation on every dictionary within that list? — gold_cy
– gold_cy, Commented Feb 12, 2019 at 18:17
@aws_apprentice Yeah that's exactly it. The Description is actually a "Database" description. My goal is to map description to Database, then list Users under said database. Sorry for the initial confusion — Jaz
– Jaz, Commented Feb 12, 2019 at 18:19
can you please show a small example of an expected output? thanks — gold_cy
– gold_cy, Commented Feb 12, 2019 at 18:20
@ChrisLarson I understand that, This is a snippet of the larger JSON file. Didn't realize the typo on the bottom when I cut it out — Jaz
– Jaz, Commented Feb 12, 2019 at 18:35

Serge Ballesta · Accepted Answer · 2019-02-12 19:12:24Z

2

Pandas is overkill here: the csv standard module is enough. You have just to iterate on policies to extract the description an database values, next on policyItems to extract the users:

with open("Ranger_Policies_20190204_195010.json") as file:
    jsonDF = json.load(file)
with open("outputfile.csv", newline='') as fd:
    wr = csv.writer(fd)
    _ = wr.writerow(('Database name', 'Users', 'Description'))
    for policy in js['policies']:
        desc = policy['description']
        db_values = policy['resources']['database']['values']
        for item in policy['policyItems']:
            users = item['users']
            for user in users:
                for db in db_values:
                    if db != '*':
                        _ = wr.writerow((db, user, desc))

answered Feb 12, 2019 at 19:12

Serge Ballesta

150k13 gold badges137 silver badges267 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Jaz Over a year ago

This works! Requires some modifications it's not capturing everything. I'll slowly need to understand the logic better to modify as needed. thanks Serge

Jaz Over a year ago

I added a line after Database Values: db_tables = policy['resources']['table']['values'] This gives me a KeyError, but it is within the Resource class, not sure why?

Jaz Over a year ago

Found the error, was a poorly formatted JSON. We also have multiple Tables inside of the resources Class. I'm trying to work that into your loop but running into difficulty.

Jaz Over a year ago

Found a issue with this messy JSON, sometimes users or database values could be [] or " ". I'll need to code in a logic around that. Appreciate you laying the foundation, just trying to work the logic out, proving to be a bit difficult

gold_cy · Accepted Answer · 2019-02-12 18:46:13Z

1

Here is one way to do it, and let's assume your json data is in a variable called json_data

from itertools import product

def make_dfs(data):
    cols = ['db_name', 'user', 'description']

    for item in data.get('policies'):
        description = item.get('description')
        users = item.get('policyItems', [{}])[0].get('users', [None])
        db_name = item.get('resources', {}).get('database', {}).get('values', [None])
        db_name = [name for name in db_name if name != '*']
        prods = product(db_name, users, [description])
        yield pd.DataFrame.from_records(prods, columns=cols)

df = pd.concat(make_dfs(json_data), ignore_index=True)

print(df)

   db_name          user                               description
0    m2_db          hive  Policy for all - database, table, column
1    m2_db  rangerlookup  Policy for all - database, table, column
2    m2_db     ambari-qa  Policy for all - database, table, column
3    m2_db          af34  Policy for all - database, table, column
4    m2_db          g748  Policy for all - database, table, column
5    m2_db          hdfs  Policy for all - database, table, column
6    m2_db          dh10  Policy for all - database, table, column
7    m2_db          gs22  Policy for all - database, table, column
8    m2_db          dh27  Policy for all - database, table, column
9    m2_db          ct52  Policy for all - database, table, column
10   m2_db  livy_pyspark  Policy for all - database, table, column

Tested on Python 3.5.1 and pandas==0.23.4

answered Feb 12, 2019 at 18:46

gold_cy

14.2k4 gold badges27 silver badges55 bronze badges

6 Comments

Jaz Over a year ago

Running into a invalid syntax error on the first line of db_name after users. This does help though to understand how to map this better

Jaz Over a year ago

points to name and says "invalid syntax" on this line db_name = item.get('resources', {}).get('database', {}).get('values', [None])

gold_cy Over a year ago

might be the way you copied it over, I have no issues running this code

Jaz Over a year ago

That's strange, I even created a new package and tried running it.

gold_cy Over a year ago

check your indentation and make sure tabs and spaces is consistent. there are no syntax errors here that I can see

|

Collectives™ on Stack Overflow

How to extract values from nested JSON array using pandas

2 Answers 2

4 Comments

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related