How to retrieve nested values in json array recursively?

Question

I have about 5k json files structured similarly. I need to get all the values for "loc" key from all the files and store it to a separate json file or two. The total values for "loc" key from all the files will count to 78 million. So how can I get this done and possibly in most optimized and fastest way.

Structure of content in all files looks like:

{
  "urlset": {
    "@xmlns": "http://www.sitemaps.org/schemas/sitemap/0.9",
    "@xmlns:xhtml": "http://www.w3.org/1999/xhtml",
    "url": [
      {
        "loc": "https://www.example.com/a",
        "xhtml:link": {
          "@rel": "alternate",
          "@href": "android-app://com.example/xyz"
        },
        "lastmod": "2020-12-25",
        "priority": "0.8"
      },
      {
        "loc": "https://www.exampe.com/b",
        "xhtml:link": {
          "@rel": "alternate",
          "@href": "android-app://com.example/xyz"
        },
        "lastmod": "2020-12-25",
        "priority": "0.8"
      }
    ]
  }
}

I am looking for output json file like:

["https://www.example.com/a","https://www.example.com/b"]

what I am current doing is:

path = r'/home/spark/' # path to folder containing files
link_list = [] # list of required links
li = "" # contains text of all files combined 

all_files = glob.glob(path + "/*")
#Looping through each file
for i in range(0,len(all_files)):
    filename = all_files[i]
    with open(filename,"r") as f: 
        li = li + f.read()

#Retrieving link from every "loc" key
for k in range(0,7800000):
    lk = ((li.split('"loc"',1)[1]).split('"',1)[1]).split(" ",1)[0]
    link = lk.replace('",','')
    link_list.append(link)

with open("output.json","w") as f:
    f.write(json.dumps(link_list))

I guess this is the worst solution anyone can get :D, so I need to optimize it to do the job fast and efficiently.

Synthaze · Accepted Answer · 2021-01-15 04:26:15Z

2

import json
import glob

dict_results = {}
dict_results['links'] = []

for filename in glob.glob("*json"):
    with open("data.json", "r") as msg:
        data = json.load(msg)
    for url in data['urlset']['url']:
        dict_results['links'].append(url['loc'])

print (dict_results)

If you just want all links, that should make it. Just write to file in text or binary as you wish after.

Output:

{'links': ['https://www.example.com/a', 'https://www.exampe.com/b']}

In case you just want a list (and so not a json):

import json
import glob

list_results = []

for filename in glob.glob("*json"):
    with open("data.json", "r") as msg:
        data = json.load(msg)
    for url in data['urlset']['url']:
        list_results.append(url['loc'])

print (list_results)

Output:

['https://www.example.com/a', 'https://www.exampe.com/b']

If you work with text json files as it seems, and that you know/trust those files, the fastest way would certainly be this one:

import glob

list_results = []

for filename in glob.glob("*json"):
    with open("data.json", "r") as msg:
        for line in msg:
            if '"loc"' in line:
                list_results.append(line.split('"')[3])

print (list_results)

edited Jan 15, 2021 at 4:26

answered Jan 15, 2021 at 4:20

Synthaze

6,1082 gold badges16 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Darsh Over a year ago

Thank you so much ! It worked like a charm. :D

Collectives™ on Stack Overflow

How to retrieve nested values in json array recursively?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related