0

I have about 5k json files structured similarly. I need to get all the values for "loc" key from all the files and store it to a separate json file or two. The total values for "loc" key from all the files will count to 78 million. So how can I get this done and possibly in most optimized and fastest way.

Structure of content in all files looks like:

{
  "urlset": {
    "@xmlns": "http://www.sitemaps.org/schemas/sitemap/0.9",
    "@xmlns:xhtml": "http://www.w3.org/1999/xhtml",
    "url": [
      {
        "loc": "https://www.example.com/a",
        "xhtml:link": {
          "@rel": "alternate",
          "@href": "android-app://com.example/xyz"
        },
        "lastmod": "2020-12-25",
        "priority": "0.8"
      },
      {
        "loc": "https://www.exampe.com/b",
        "xhtml:link": {
          "@rel": "alternate",
          "@href": "android-app://com.example/xyz"
        },
        "lastmod": "2020-12-25",
        "priority": "0.8"
      }
    ]
  }
}

I am looking for output json file like:

["https://www.example.com/a","https://www.example.com/b"]

what I am current doing is:

path = r'/home/spark/' # path to folder containing files
link_list = [] # list of required links
li = "" # contains text of all files combined 

all_files = glob.glob(path + "/*")
#Looping through each file
for i in range(0,len(all_files)):
    filename = all_files[i]
    with open(filename,"r") as f: 
        li = li + f.read()

#Retrieving link from every "loc" key
for k in range(0,7800000):
    lk = ((li.split('"loc"',1)[1]).split('"',1)[1]).split(" ",1)[0]
    link = lk.replace('",','')
    link_list.append(link)

with open("output.json","w") as f:
    f.write(json.dumps(link_list))

I guess this is the worst solution anyone can get :D, so I need to optimize it to do the job fast and efficiently.

1 Answer 1

2
import json
import glob

dict_results = {}
dict_results['links'] = []

for filename in glob.glob("*json"):
    with open("data.json", "r") as msg:
        data = json.load(msg)
    for url in data['urlset']['url']:
        dict_results['links'].append(url['loc'])

print (dict_results)

If you just want all links, that should make it. Just write to file in text or binary as you wish after.

Output:

{'links': ['https://www.example.com/a', 'https://www.exampe.com/b']}

In case you just want a list (and so not a json):

import json
import glob

list_results = []

for filename in glob.glob("*json"):
    with open("data.json", "r") as msg:
        data = json.load(msg)
    for url in data['urlset']['url']:
        list_results.append(url['loc'])

print (list_results)

Output:

['https://www.example.com/a', 'https://www.exampe.com/b']

If you work with text json files as it seems, and that you know/trust those files, the fastest way would certainly be this one:

import glob

list_results = []

for filename in glob.glob("*json"):
    with open("data.json", "r") as msg:
        for line in msg:
            if '"loc"' in line:
                list_results.append(line.split('"')[3])

print (list_results)
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you so much ! It worked like a charm. :D

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.