I'm using lxml XPath to parse the following xml file
<urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"
xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
<url>
<loc>
https://www.reuters.com/article/us-campbellsoup-thirdpoint/campbell-soup-nears-deal-with-third-point-to-end-board-challenge-sources-idUSKCN1NU11I
</loc>
<image:image>
<image:loc>
https://www.reuters.com/resources/r/?m=02&d=20181126&t=2&i=1328589868&w=&fh=&fw=&ll=460&pl=300&r=LYNXNPEEAO0WM
</image:loc>
</image:image>
<news:news>
<news:publication>
<news:name>Reuters</news:name>
<news:language>eng</news:language>
</news:publication>
<news:publication_date>2018-11-26T02:55:00+00:00</news:publication_date>
<news:title>
Campbell Soup nears deal with Third Point to end board challenge: sources
</news:title>
<news:keywords>Headlines,Business, Industry</news:keywords>
<news:stock_tickers>NYSE:CPB</news:stock_tickers>
</news:news>
</url>
</urlset>
Python code sample
import lxml.etree
import lxml.html
import requests
def main():
r = requests.get("https://www.reuters.com/sitemap_news_index1.xml")
namespace = "http://www.google.com/schemas/sitemap-news/0.9"
root = lxml.etree.fromstring(r.content)
records = root.xpath('//news:title', namespaces = {"news": "http://www.google.com/schemas/sitemap-news/0.9"})
for record in records:
print(record.text)
records = root.xpath('//sitemap:loc', namespaces = {"sitemap": "http://www.sitemaps.org/schemas/sitemap/0.9"})
for record in records:
print(record.text)
if __name__ == "__main__":
main()
Currently, I'm XPath to get all URL and title, but this is not what I want because I don't know which URL belongs to which title. My question is how to get each <url>, then loop each <url> as item to get corresponding <loc> and <news:keywords> etc. Thanks!
Edit: Expecting output
foreach <url>
get <loc>
get <news:publication_date>
get <news:title>