2

I have a dataset containing nested json object. I wish to extract information from this nested json and put it in a DataFrame in python. I have used json_normalize method but i am unable to parse after a certain level. Kindly help. Thank you.

2
  • Can you elaborate ? Give a sample of data? Commented Sep 8, 2021 at 12:48
  • How should the DF look like? Which fields of the json should be in the DF as well? Commented Sep 8, 2021 at 13:07

2 Answers 2

2

Have been working on a function that will expand all embedded lists and dictionaries.

from pathlib import Path

with open(Path.home().joinpath("Downloads").joinpath("Sample Json.txt")) as f: js = f.read()

def normalize(js, expand_all=False):
    df = pd.json_normalize(json.loads(js) if type(js) == str else js)
    # get first column that contains lists
    col = df.applymap(type).astype(str).eq("<class 'list'>").all().idxmax()
    # explode list and expand embedded dictionaries
    df = df.explode(col).reset_index(drop=True)
    df = df.drop(columns=[col]).join(df[col].apply(pd.Series), rsuffix=f".{col}")
    # any dictionary to expand?
    if df.applymap(type).astype(str).eq("<class 'dict'>").any().any():
        col = df.applymap(type).astype(str).eq("<class 'dict'>").all().idxmax()
        df = df.drop(columns=[col]).join(df[col].apply(pd.Series), rsuffix=f".{col}")

    # any lists left?
    while expand_all and df.applymap(type).astype(str).eq("<class 'list'>").any().any():
        df = normalize(df.to_dict("records"))
    return df

    
    
df = normalize(js, expand_all=True)

cfs ctin fldtr1 cfs3b flprdr1 dtcancel val inv_typ pos idt rchrg inum chksum num csamt samt rt txval camt iamt
0 Y 03AZX 10-Aug-20 Y Jul-20 nan 2390 R 03 27-07-2020 N TI/20-21/111 24ea1a46933dd7c6f130cc7ddce3ad89f42194d84e358746f66716d0f1b8aef0 101 0 182.25 18 2025 182.25 0
1 Y 03AZY 02-Sep-20 Y Jul-20 nan 10756 R 03 20-07-2020 N 70 164777293c8ce80595cd4803c3d0287bc544772fb9e5331602ed3d7d0534e82f 1801 0 820.35 18 9115 820.35 nan
2 Y 03A00P1Z7 10-Aug-20 Y Jul-20 nan 411.82 R 03 01-07-2020 N 18IPB06013580804 0560d2b220de53f458ac65594f50bfa5ba736f95061c88201d91371fbeccabf8 1 0 31.41 18 349 31.41 nan
3 Y 03A00P1Z7 10-Aug-20 Y Jul-20 nan 411.82 R 03 01-07-2020 N 18IPB06013580805 08ae71bcb591723318796e797da586ef9b8e5b6b920e9877be6afc9223486760 1 0 31.41 18 349 31.41 nan
4 Y 03A00P1Z7 10-Aug-20 Y Jul-20 nan 383.5 R 03 01-07-2020 N 18IPB06013580806 4d22ddd1d05d22cc4707a89dd80e76a271b99a7ba2610e3b111489fd4f7950fc 1 0 29.25 18 325 29.25 nan
5 Y 03A00P1Z7 10-Aug-20 Y Jul-20 nan 496.78 R 03 01-07-2020 N 18IPB06013580807 73e6e787493276151783d5ab1107bd0bac53780a5840964f7953bf3ba8a4efb0 1 0 37.89 18 421 37.89 nan
6 Y 03A00P1Z7 10-Aug-20 Y Jul-20 nan 411.82 R 03 21-07-2020 N 18IPB07013893564 52ef0e7269de052c0353580cad5092ff1cc7a3c454318b2df1041a62a32f033f 1 0 31.41 18 349 31.41 nan
7 Y 03A00P1Z7 10-Aug-20 Y Jul-20 nan 411.82 R 03 21-07-2020 N 18IPB07013893565 ab44c119f3db614dccfd3bc63c036eaca22a41c99e3e5090904e38aee056f4ac 1 0 31.41 18 349 31.41 nan
8 Y 03CAZD 10-Aug-20 Y Jul-20 nan 162840 R 03 13-07-2020 N T/20-21/56 92e52e48e812bb0bb2e34d9e400248730fdc40363459d05c4e9d6ebb7fe6165d 101 0 12420 18 138000 12420 0
9 Y 03AAE 22-Aug-20 Y Jul-20 nan 46556 R 03 30-07-2020 N S20/21-359 8138e35895114ae412e8256f3ce8382cdd8ae771f2780781085134618bb033c9 1801 0 3550.87 18 39454.2 3550.87 0
10 Y 03AAD1ZA 11-Aug-20 Y Jul-20 nan 8417.98 R 03 02-07-2020 N 0000030301011976 70d17e281b22541b3d41eb3269d057b73140c203771365a892dd496ffc756adb 1 0 0 0 1024.84 0 nan
11 Y 03AAD1ZA 11-Aug-20 Y Jul-20 nan 8417.98 R 03 02-07-2020 N 0000030301011976 70d17e281b22541b3d41eb3269d057b73140c203771365a892dd496ffc756adb 2 0 233.58 18 2595.37 233.58 nan
12 Y 03AAD1ZA 11-Aug-20 Y Jul-20 nan 8417.98 R 03 02-07-2020 N 0000030301011976 70d17e281b22541b3d41eb3269d057b73140c203771365a892dd496ffc756adb 3 0 89.34 5 3573.99 89.34 nan
13 Y 03AAD1ZA 11-Aug-20 Y Jul-20 nan 8417.98 R 03 02-07-2020 N 0000030301011976 70d17e281b22541b3d41eb3269d057b73140c203771365a892dd496ffc756adb 4 0 30.96 12 516.02 30.96 nan
14 Y 03AAD1ZA 11-Aug-20 Y Jul-20 nan 2824.88 R 03 06-07-2020 N 0000030301012348 2e7978264e42a74a70aa35d39ca6856f4dfb333e76935667a8de2733f888a1f1 1 0 116.46 18 1293.94 116.46 nan
15 Y 03AAD1ZA 11-Aug-20 Y Jul-20 nan 2824.88 R 03 06-07-2020 N 0000030301012348 2e7978264e42a74a70aa35d39ca6856f4dfb333e76935667a8de2733f888a1f1 2 0 37.27 12 621.18 37.27 nan
16 Y 03AAD1ZA 11-Aug-20 Y Jul-20 nan 2824.88 R 03 06-07-2020 N 0000030301012348 2e7978264e42a74a70aa35d39ca6856f4dfb333e76935667a8de2733f888a1f1 3 0 0 0 85.26 0 nan
17 Y 03AAD1ZA 11-Aug-20 Y Jul-20 nan 2824.88 R 03 06-07-2020 N 0000030301012348 2e7978264e42a74a70aa35d39ca6856f4dfb333e76935667a8de2733f888a1f1 4 0 12.31 5 492.42 12.31 nan
18 Y 03AA1ZQ 17-Aug-20 Y Jul-20 nan 39294 R 03 02-07-2020 N TI/20-21/43 69f7931986ad9274d9595ca5221e3ce82aa389d659e83376ff1ec34571057670 101 0 2997 18 33300 2997 0
19 Y 03AGG3Z5 18-Aug-20 Y Jul-20 22-Jan-20 593583 R 03 31-07-2020 N 25 623dcb5b65e34be4d0453c1783915bb8e66684a2e33a3c8a547e38754c4f1af9 1 0 45273.3 18 503036 45273.3 nan
20 Y 03AGG3Z5 18-Aug-20 Y Jul-20 22-Jan-20 601409 R 03 31-07-2020 N 26 ef8b99f99fe090f0a2374d8d6c0b15c265740e6c6487ff68d510382ec21d8ce4 1 0 45870.2 18 509668 45870.2 nan
21 Y 03AGG3Z5 18-Aug-20 Y Jul-20 22-Jan-20 767358 R 03 31-07-2020 N 27 9c1257eddeb8cdc7e6a832a3646969b71e49eeeb7d6742b26cfc6e0e3630438a 1 0 58527.3 18 650303 58527.3 nan
22 Y 03AGG3Z5 18-Aug-20 Y Jul-20 22-Jan-20 597886 R 03 31-07-2020 N 28 29fc1b28aedd1545e7ea0fd8b67b8332a83f1ac3f62af9398af2dfa26c9f1d90 1 0 45601.4 18 506683 45601.4 nan
23 Y 03AA9 18-Aug-20 Y Jul-20 nan 41914 R 03 29-07-2020 N 2020-21/K-916 d112ad384eb291d49509bdf4a005d509424fefee4caf3443bc9726cf41665295 1801 0 3196.8 18 35520 3196.8 nan
24 Y 03A1Z8 12-Aug-20 Y Jul-20 nan 274893 R 03 20-07-2020 N T/20-21/10 e5851fcc6b370714d7523080582a678a212f5dde90f5c2618880376018221f38 101 0 20966.4 18 232960 20966.4 0
25 Y 03AD1ZL 11-Aug-20 Y Jul-20 nan 125375 R 03 03-07-2020 N T/20-21/155 2bb398c7a0fedf11f1f1c1d196c43ad79910be52e6892f88915671025528eb2b 101 0 9562.5 18 106250 9562.5 0
26 Y 03AA3Z9 14-Aug-20 Y Jul-20 nan 529.99 R 03 31-07-2020 N 0301072000000650 ad1e1d1572c9058fabd6d23fb5dc4b68f1a2a10d3dd3d7e73d73d3c502d92151 1 nan 40.42 18 449.15 40.42 nan
27 Y 03AA3Z9 14-Aug-20 Y Jul-20 nan 1201 R 03 31-07-2020 N 0303072000000025 5a69229d907957c1d95eb464684891c202102b8589f5603b8ae14b07607f1655 1 nan 91.5 18 1018 91.5 nan
28 Y 03AB1ZV 11-Aug-20 Y Jul-20 nan 30976 R 03 10-07-2020 N 70 69bbeb088634a88b30c6e6046b63b1977f5534b2f676b984ef78f2c3bad8ca35 1800 nan 2362.5 18 26250 2362.5 nan
29 Y 03AD1Z1 13-Aug-20 Y Jul-20 nan 8968 R 03 01-07-2020 N B25 5b98b819ca14a377c9304e7eab21957152c4819e82e37f2619fb2c547fb84ba6 1801 0 684 18 7600 684 nan
30 Y 03AAO 10-Aug-20 Y Jul-20 nan 38940 R 03 13-07-2020 N TI/20-21/30 bae339e580c2ab9ffee90533650e4e2acdc47310230ed54aabbb96f89d3fc7c4 101 0 2970 18 33000 2970 0
31 Y 07AH1ZU 11-Aug-20 Y Jul-20 nan 13836.5 R 03 31-07-2020 N DELR/EXP/12176 cb34f329adcd88c9e8794db9892fe47bd0a7afc0373a20860de046934f7923fa 1 0 nan 18 11725.9 nan 2110.65
32 Y 03A1ZT 18-Aug-20 Y Jul-20 nan 41820 R 03 07-07-2020 N TI/20-21/68 ad61c4dd8227b214dbe4bba24b57a2c976ce8438e53cf15b3530480116ca64da 101 0 3189.69 18 35441 3189.69 0
33 Y 03A1ZT 18-Aug-20 Y Jul-20 nan 69773 R 03 10-07-2020 N TI/20-21/71 1deca4741b91716bfabc8b2ab826be76342b0fd3e698b128c927f4b426c064d0 101 0 5321.7 18 59130 5321.7 0
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you so much. I see that the result is as expected. May I also ask one thing more...If we have to import the JSON from any url, what modification do we need to make to the code above? Regards.
just pass the JSON to the function, it accepts it as a string or dictionary. so something like normalize(requests.get("http://someservice.local").json(), expand_all=True) would work
1

To "flat" a nested json file, you can use the following function:

def flatten_json(nested_json):       
    out = {}

    def flatten(x, name=''):
        if type(x) is dict:
            for a in x:
                flatten(x[a], name + a + '_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
        else:
            out[name[:-1]] = x

    flatten(nested_json)
    return out

Assuming your json is called myjson:

df = pd.Series(flatten_json(myjson)).to_frame()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.