0

I have a large JSON file (400k lines). I am trying to isolate the following:

Policies- "description"

policy items - "users" and "database values"

JSON FILE - https://pastebin.com/hv8mLfgx

Expected Output from Pandas: https://i.sstatic.net/enkx6.jpg

Everything after "Policy Items" is re-iterated the exact same throughout the file. I have tried the code below to isolate "users". It doesn't seem to work, I'm trying to dump all of this into a CSV.

Edit* here was a solution I was attempting to try, but could not get this to work correctly - Deeply nested JSON response to pandas dataframe

from pandas.io.json import json_normalize as Jnormal
import json
import pprint, csv
import re

with open("Ranger_Policies_20190204_195010.json") as file:
    jsonDF = json.load(file)
    for item in jsonDF['policies'][0]['policyItems'][0]:
        print ('{} - {} - {}'.format(jsonDF['users']))

EDIT 2: I have some working code which is able to grab some of the USERS, but it does not grab all of them. Only 11 out of 25.

from pandas.io.json import json_normalize as Jnormal
import json
import pprint, csv
import re

with open("Ranger_Policies_20190204_195010.json") as file:
    jsonDF = json.load(file)
    pNode = Jnormal(jsonDF['policies'][0]['policyItems'], record_path='users')
    print(pNode.head(500))

EDIT 3: This is the Final working copy, however I am still not copying over all my TABLE data. I set a loop to simply ignore everything. Capture everything and I'd sort it in Excel, Does anyone have any ideas why I cannot capture all the TABLE values?

    json_data = json.load(file)
    with open("test.csv", 'w', newline='') as fd:
        wr = csv.writer(fd)
        wr.writerow(('Database name', 'Users', 'Description', 'Table'))
        for policy in json_data['policies']:
            desc = policy['description']
            db_values = policy['resources']['database']['values']
            db_tables = policy['resources']['table']['values']
            for item in policy['policyItems']:
                users = item['users']
                for dbT in db_tables:
                    for user in users:
                        for db in db_values:
                            _ = wr.writerow((db, user, desc, dbT))```

12
  • I've thought about building some recursion into this with mapping. I personally don't even know where to start since I am a Beginner in Python. Any advice or direction would be appreciated. Commented Feb 12, 2019 at 18:07
  • so you want something that maps a description to the users? and policies is just a big list and you want to perform that operation on every dictionary within that list? Commented Feb 12, 2019 at 18:17
  • @aws_apprentice Yeah that's exactly it. The Description is actually a "Database" description. My goal is to map description to Database, then list Users under said database. Sorry for the initial confusion Commented Feb 12, 2019 at 18:19
  • can you please show a small example of an expected output? thanks Commented Feb 12, 2019 at 18:20
  • 1
    @ChrisLarson I understand that, This is a snippet of the larger JSON file. Didn't realize the typo on the bottom when I cut it out Commented Feb 12, 2019 at 18:35

2 Answers 2

2

Pandas is overkill here: the csv standard module is enough. You have just to iterate on policies to extract the description an database values, next on policyItems to extract the users:

with open("Ranger_Policies_20190204_195010.json") as file:
    jsonDF = json.load(file)
with open("outputfile.csv", newline='') as fd:
    wr = csv.writer(fd)
    _ = wr.writerow(('Database name', 'Users', 'Description'))
    for policy in js['policies']:
        desc = policy['description']
        db_values = policy['resources']['database']['values']
        for item in policy['policyItems']:
            users = item['users']
            for user in users:
                for db in db_values:
                    if db != '*':
                        _ = wr.writerow((db, user, desc))
Sign up to request clarification or add additional context in comments.

4 Comments

This works! Requires some modifications it's not capturing everything. I'll slowly need to understand the logic better to modify as needed. thanks Serge
I added a line after Database Values: db_tables = policy['resources']['table']['values'] This gives me a KeyError, but it is within the Resource class, not sure why?
Found the error, was a poorly formatted JSON. We also have multiple Tables inside of the resources Class. I'm trying to work that into your loop but running into difficulty.
Found a issue with this messy JSON, sometimes users or database values could be [] or " ". I'll need to code in a logic around that. Appreciate you laying the foundation, just trying to work the logic out, proving to be a bit difficult
1

Here is one way to do it, and let's assume your json data is in a variable called json_data

from itertools import product

def make_dfs(data):
    cols = ['db_name', 'user', 'description']

    for item in data.get('policies'):
        description = item.get('description')
        users = item.get('policyItems', [{}])[0].get('users', [None])
        db_name = item.get('resources', {}).get('database', {}).get('values', [None])
        db_name = [name for name in db_name if name != '*']
        prods = product(db_name, users, [description])
        yield pd.DataFrame.from_records(prods, columns=cols)

df = pd.concat(make_dfs(json_data), ignore_index=True)

print(df)

   db_name          user                               description
0    m2_db          hive  Policy for all - database, table, column
1    m2_db  rangerlookup  Policy for all - database, table, column
2    m2_db     ambari-qa  Policy for all - database, table, column
3    m2_db          af34  Policy for all - database, table, column
4    m2_db          g748  Policy for all - database, table, column
5    m2_db          hdfs  Policy for all - database, table, column
6    m2_db          dh10  Policy for all - database, table, column
7    m2_db          gs22  Policy for all - database, table, column
8    m2_db          dh27  Policy for all - database, table, column
9    m2_db          ct52  Policy for all - database, table, column
10   m2_db  livy_pyspark  Policy for all - database, table, column

Tested on Python 3.5.1 and pandas==0.23.4

6 Comments

Running into a invalid syntax error on the first line of db_name after users. This does help though to understand how to map this better
points to name and says "invalid syntax" on this line db_name = item.get('resources', {}).get('database', {}).get('values', [None])
might be the way you copied it over, I have no issues running this code
That's strange, I even created a new package and tried running it.
check your indentation and make sure tabs and spaces is consistent. there are no syntax errors here that I can see
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.