I have about 5k json files structured similarly. I need to get all the values for "loc" key from all the files and store it to a separate json file or two. The total values for "loc" key from all the files will count to 78 million. So how can I get this done and possibly in most optimized and fastest way.
Structure of content in all files looks like:
{
"urlset": {
"@xmlns": "http://www.sitemaps.org/schemas/sitemap/0.9",
"@xmlns:xhtml": "http://www.w3.org/1999/xhtml",
"url": [
{
"loc": "https://www.example.com/a",
"xhtml:link": {
"@rel": "alternate",
"@href": "android-app://com.example/xyz"
},
"lastmod": "2020-12-25",
"priority": "0.8"
},
{
"loc": "https://www.exampe.com/b",
"xhtml:link": {
"@rel": "alternate",
"@href": "android-app://com.example/xyz"
},
"lastmod": "2020-12-25",
"priority": "0.8"
}
]
}
}
I am looking for output json file like:
["https://www.example.com/a","https://www.example.com/b"]
what I am current doing is:
path = r'/home/spark/' # path to folder containing files
link_list = [] # list of required links
li = "" # contains text of all files combined
all_files = glob.glob(path + "/*")
#Looping through each file
for i in range(0,len(all_files)):
filename = all_files[i]
with open(filename,"r") as f:
li = li + f.read()
#Retrieving link from every "loc" key
for k in range(0,7800000):
lk = ((li.split('"loc"',1)[1]).split('"',1)[1]).split(" ",1)[0]
link = lk.replace('",','')
link_list.append(link)
with open("output.json","w") as f:
f.write(json.dumps(link_list))
I guess this is the worst solution anyone can get :D, so I need to optimize it to do the job fast and efficiently.