I have a dataset containing nested json object. I wish to extract information from this nested json and put it in a DataFrame in python. I have used json_normalize method but i am unable to parse after a certain level. Kindly help. Thank you.
-
Can you elaborate ? Give a sample of data?WArnold– WArnold2021-09-08 12:48:24 +00:00Commented Sep 8, 2021 at 12:48
-
How should the DF look like? Which fields of the json should be in the DF as well?balderman– balderman2021-09-08 13:07:53 +00:00Commented Sep 8, 2021 at 13:07
Add a comment
|
2 Answers
Have been working on a function that will expand all embedded lists and dictionaries.
from pathlib import Path
with open(Path.home().joinpath("Downloads").joinpath("Sample Json.txt")) as f: js = f.read()
def normalize(js, expand_all=False):
df = pd.json_normalize(json.loads(js) if type(js) == str else js)
# get first column that contains lists
col = df.applymap(type).astype(str).eq("<class 'list'>").all().idxmax()
# explode list and expand embedded dictionaries
df = df.explode(col).reset_index(drop=True)
df = df.drop(columns=[col]).join(df[col].apply(pd.Series), rsuffix=f".{col}")
# any dictionary to expand?
if df.applymap(type).astype(str).eq("<class 'dict'>").any().any():
col = df.applymap(type).astype(str).eq("<class 'dict'>").all().idxmax()
df = df.drop(columns=[col]).join(df[col].apply(pd.Series), rsuffix=f".{col}")
# any lists left?
while expand_all and df.applymap(type).astype(str).eq("<class 'list'>").any().any():
df = normalize(df.to_dict("records"))
return df
df = normalize(js, expand_all=True)
| cfs | ctin | fldtr1 | cfs3b | flprdr1 | dtcancel | val | inv_typ | pos | idt | rchrg | inum | chksum | num | csamt | samt | rt | txval | camt | iamt | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Y | 03AZX | 10-Aug-20 | Y | Jul-20 | nan | 2390 | R | 03 | 27-07-2020 | N | TI/20-21/111 | 24ea1a46933dd7c6f130cc7ddce3ad89f42194d84e358746f66716d0f1b8aef0 | 101 | 0 | 182.25 | 18 | 2025 | 182.25 | 0 |
| 1 | Y | 03AZY | 02-Sep-20 | Y | Jul-20 | nan | 10756 | R | 03 | 20-07-2020 | N | 70 | 164777293c8ce80595cd4803c3d0287bc544772fb9e5331602ed3d7d0534e82f | 1801 | 0 | 820.35 | 18 | 9115 | 820.35 | nan |
| 2 | Y | 03A00P1Z7 | 10-Aug-20 | Y | Jul-20 | nan | 411.82 | R | 03 | 01-07-2020 | N | 18IPB06013580804 | 0560d2b220de53f458ac65594f50bfa5ba736f95061c88201d91371fbeccabf8 | 1 | 0 | 31.41 | 18 | 349 | 31.41 | nan |
| 3 | Y | 03A00P1Z7 | 10-Aug-20 | Y | Jul-20 | nan | 411.82 | R | 03 | 01-07-2020 | N | 18IPB06013580805 | 08ae71bcb591723318796e797da586ef9b8e5b6b920e9877be6afc9223486760 | 1 | 0 | 31.41 | 18 | 349 | 31.41 | nan |
| 4 | Y | 03A00P1Z7 | 10-Aug-20 | Y | Jul-20 | nan | 383.5 | R | 03 | 01-07-2020 | N | 18IPB06013580806 | 4d22ddd1d05d22cc4707a89dd80e76a271b99a7ba2610e3b111489fd4f7950fc | 1 | 0 | 29.25 | 18 | 325 | 29.25 | nan |
| 5 | Y | 03A00P1Z7 | 10-Aug-20 | Y | Jul-20 | nan | 496.78 | R | 03 | 01-07-2020 | N | 18IPB06013580807 | 73e6e787493276151783d5ab1107bd0bac53780a5840964f7953bf3ba8a4efb0 | 1 | 0 | 37.89 | 18 | 421 | 37.89 | nan |
| 6 | Y | 03A00P1Z7 | 10-Aug-20 | Y | Jul-20 | nan | 411.82 | R | 03 | 21-07-2020 | N | 18IPB07013893564 | 52ef0e7269de052c0353580cad5092ff1cc7a3c454318b2df1041a62a32f033f | 1 | 0 | 31.41 | 18 | 349 | 31.41 | nan |
| 7 | Y | 03A00P1Z7 | 10-Aug-20 | Y | Jul-20 | nan | 411.82 | R | 03 | 21-07-2020 | N | 18IPB07013893565 | ab44c119f3db614dccfd3bc63c036eaca22a41c99e3e5090904e38aee056f4ac | 1 | 0 | 31.41 | 18 | 349 | 31.41 | nan |
| 8 | Y | 03CAZD | 10-Aug-20 | Y | Jul-20 | nan | 162840 | R | 03 | 13-07-2020 | N | T/20-21/56 | 92e52e48e812bb0bb2e34d9e400248730fdc40363459d05c4e9d6ebb7fe6165d | 101 | 0 | 12420 | 18 | 138000 | 12420 | 0 |
| 9 | Y | 03AAE | 22-Aug-20 | Y | Jul-20 | nan | 46556 | R | 03 | 30-07-2020 | N | S20/21-359 | 8138e35895114ae412e8256f3ce8382cdd8ae771f2780781085134618bb033c9 | 1801 | 0 | 3550.87 | 18 | 39454.2 | 3550.87 | 0 |
| 10 | Y | 03AAD1ZA | 11-Aug-20 | Y | Jul-20 | nan | 8417.98 | R | 03 | 02-07-2020 | N | 0000030301011976 | 70d17e281b22541b3d41eb3269d057b73140c203771365a892dd496ffc756adb | 1 | 0 | 0 | 0 | 1024.84 | 0 | nan |
| 11 | Y | 03AAD1ZA | 11-Aug-20 | Y | Jul-20 | nan | 8417.98 | R | 03 | 02-07-2020 | N | 0000030301011976 | 70d17e281b22541b3d41eb3269d057b73140c203771365a892dd496ffc756adb | 2 | 0 | 233.58 | 18 | 2595.37 | 233.58 | nan |
| 12 | Y | 03AAD1ZA | 11-Aug-20 | Y | Jul-20 | nan | 8417.98 | R | 03 | 02-07-2020 | N | 0000030301011976 | 70d17e281b22541b3d41eb3269d057b73140c203771365a892dd496ffc756adb | 3 | 0 | 89.34 | 5 | 3573.99 | 89.34 | nan |
| 13 | Y | 03AAD1ZA | 11-Aug-20 | Y | Jul-20 | nan | 8417.98 | R | 03 | 02-07-2020 | N | 0000030301011976 | 70d17e281b22541b3d41eb3269d057b73140c203771365a892dd496ffc756adb | 4 | 0 | 30.96 | 12 | 516.02 | 30.96 | nan |
| 14 | Y | 03AAD1ZA | 11-Aug-20 | Y | Jul-20 | nan | 2824.88 | R | 03 | 06-07-2020 | N | 0000030301012348 | 2e7978264e42a74a70aa35d39ca6856f4dfb333e76935667a8de2733f888a1f1 | 1 | 0 | 116.46 | 18 | 1293.94 | 116.46 | nan |
| 15 | Y | 03AAD1ZA | 11-Aug-20 | Y | Jul-20 | nan | 2824.88 | R | 03 | 06-07-2020 | N | 0000030301012348 | 2e7978264e42a74a70aa35d39ca6856f4dfb333e76935667a8de2733f888a1f1 | 2 | 0 | 37.27 | 12 | 621.18 | 37.27 | nan |
| 16 | Y | 03AAD1ZA | 11-Aug-20 | Y | Jul-20 | nan | 2824.88 | R | 03 | 06-07-2020 | N | 0000030301012348 | 2e7978264e42a74a70aa35d39ca6856f4dfb333e76935667a8de2733f888a1f1 | 3 | 0 | 0 | 0 | 85.26 | 0 | nan |
| 17 | Y | 03AAD1ZA | 11-Aug-20 | Y | Jul-20 | nan | 2824.88 | R | 03 | 06-07-2020 | N | 0000030301012348 | 2e7978264e42a74a70aa35d39ca6856f4dfb333e76935667a8de2733f888a1f1 | 4 | 0 | 12.31 | 5 | 492.42 | 12.31 | nan |
| 18 | Y | 03AA1ZQ | 17-Aug-20 | Y | Jul-20 | nan | 39294 | R | 03 | 02-07-2020 | N | TI/20-21/43 | 69f7931986ad9274d9595ca5221e3ce82aa389d659e83376ff1ec34571057670 | 101 | 0 | 2997 | 18 | 33300 | 2997 | 0 |
| 19 | Y | 03AGG3Z5 | 18-Aug-20 | Y | Jul-20 | 22-Jan-20 | 593583 | R | 03 | 31-07-2020 | N | 25 | 623dcb5b65e34be4d0453c1783915bb8e66684a2e33a3c8a547e38754c4f1af9 | 1 | 0 | 45273.3 | 18 | 503036 | 45273.3 | nan |
| 20 | Y | 03AGG3Z5 | 18-Aug-20 | Y | Jul-20 | 22-Jan-20 | 601409 | R | 03 | 31-07-2020 | N | 26 | ef8b99f99fe090f0a2374d8d6c0b15c265740e6c6487ff68d510382ec21d8ce4 | 1 | 0 | 45870.2 | 18 | 509668 | 45870.2 | nan |
| 21 | Y | 03AGG3Z5 | 18-Aug-20 | Y | Jul-20 | 22-Jan-20 | 767358 | R | 03 | 31-07-2020 | N | 27 | 9c1257eddeb8cdc7e6a832a3646969b71e49eeeb7d6742b26cfc6e0e3630438a | 1 | 0 | 58527.3 | 18 | 650303 | 58527.3 | nan |
| 22 | Y | 03AGG3Z5 | 18-Aug-20 | Y | Jul-20 | 22-Jan-20 | 597886 | R | 03 | 31-07-2020 | N | 28 | 29fc1b28aedd1545e7ea0fd8b67b8332a83f1ac3f62af9398af2dfa26c9f1d90 | 1 | 0 | 45601.4 | 18 | 506683 | 45601.4 | nan |
| 23 | Y | 03AA9 | 18-Aug-20 | Y | Jul-20 | nan | 41914 | R | 03 | 29-07-2020 | N | 2020-21/K-916 | d112ad384eb291d49509bdf4a005d509424fefee4caf3443bc9726cf41665295 | 1801 | 0 | 3196.8 | 18 | 35520 | 3196.8 | nan |
| 24 | Y | 03A1Z8 | 12-Aug-20 | Y | Jul-20 | nan | 274893 | R | 03 | 20-07-2020 | N | T/20-21/10 | e5851fcc6b370714d7523080582a678a212f5dde90f5c2618880376018221f38 | 101 | 0 | 20966.4 | 18 | 232960 | 20966.4 | 0 |
| 25 | Y | 03AD1ZL | 11-Aug-20 | Y | Jul-20 | nan | 125375 | R | 03 | 03-07-2020 | N | T/20-21/155 | 2bb398c7a0fedf11f1f1c1d196c43ad79910be52e6892f88915671025528eb2b | 101 | 0 | 9562.5 | 18 | 106250 | 9562.5 | 0 |
| 26 | Y | 03AA3Z9 | 14-Aug-20 | Y | Jul-20 | nan | 529.99 | R | 03 | 31-07-2020 | N | 0301072000000650 | ad1e1d1572c9058fabd6d23fb5dc4b68f1a2a10d3dd3d7e73d73d3c502d92151 | 1 | nan | 40.42 | 18 | 449.15 | 40.42 | nan |
| 27 | Y | 03AA3Z9 | 14-Aug-20 | Y | Jul-20 | nan | 1201 | R | 03 | 31-07-2020 | N | 0303072000000025 | 5a69229d907957c1d95eb464684891c202102b8589f5603b8ae14b07607f1655 | 1 | nan | 91.5 | 18 | 1018 | 91.5 | nan |
| 28 | Y | 03AB1ZV | 11-Aug-20 | Y | Jul-20 | nan | 30976 | R | 03 | 10-07-2020 | N | 70 | 69bbeb088634a88b30c6e6046b63b1977f5534b2f676b984ef78f2c3bad8ca35 | 1800 | nan | 2362.5 | 18 | 26250 | 2362.5 | nan |
| 29 | Y | 03AD1Z1 | 13-Aug-20 | Y | Jul-20 | nan | 8968 | R | 03 | 01-07-2020 | N | B25 | 5b98b819ca14a377c9304e7eab21957152c4819e82e37f2619fb2c547fb84ba6 | 1801 | 0 | 684 | 18 | 7600 | 684 | nan |
| 30 | Y | 03AAO | 10-Aug-20 | Y | Jul-20 | nan | 38940 | R | 03 | 13-07-2020 | N | TI/20-21/30 | bae339e580c2ab9ffee90533650e4e2acdc47310230ed54aabbb96f89d3fc7c4 | 101 | 0 | 2970 | 18 | 33000 | 2970 | 0 |
| 31 | Y | 07AH1ZU | 11-Aug-20 | Y | Jul-20 | nan | 13836.5 | R | 03 | 31-07-2020 | N | DELR/EXP/12176 | cb34f329adcd88c9e8794db9892fe47bd0a7afc0373a20860de046934f7923fa | 1 | 0 | nan | 18 | 11725.9 | nan | 2110.65 |
| 32 | Y | 03A1ZT | 18-Aug-20 | Y | Jul-20 | nan | 41820 | R | 03 | 07-07-2020 | N | TI/20-21/68 | ad61c4dd8227b214dbe4bba24b57a2c976ce8438e53cf15b3530480116ca64da | 101 | 0 | 3189.69 | 18 | 35441 | 3189.69 | 0 |
| 33 | Y | 03A1ZT | 18-Aug-20 | Y | Jul-20 | nan | 69773 | R | 03 | 10-07-2020 | N | TI/20-21/71 | 1deca4741b91716bfabc8b2ab826be76342b0fd3e698b128c927f4b426c064d0 | 101 | 0 | 5321.7 | 18 | 59130 | 5321.7 | 0 |
2 Comments
Jodhvir Singh
Thank you so much. I see that the result is as expected. May I also ask one thing more...If we have to import the JSON from any url, what modification do we need to make to the code above? Regards.
Rob Raymond
just pass the JSON to the function, it accepts it as a string or dictionary. so something like
normalize(requests.get("http://someservice.local").json(), expand_all=True) would workTo "flat" a nested json file, you can use the following function:
def flatten_json(nested_json):
out = {}
def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(nested_json)
return out
Assuming your json is called myjson:
df = pd.Series(flatten_json(myjson)).to_frame()