Using lxml xpath to parse xml file

Question

I'm using lxml XPath to parse the following xml file

<urlset
    xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
    xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"
    xmlns:image="http://www.google.com/schemas/sitemap-image/1.1">
    <url>
        <loc>
    https://www.reuters.com/article/us-campbellsoup-thirdpoint/campbell-soup-nears-deal-with-third-point-to-end-board-challenge-sources-idUSKCN1NU11I
    </loc>
        <image:image>
            <image:loc>
    https://www.reuters.com/resources/r/?m=02&d=20181126&t=2&i=1328589868&w=&fh=&fw=&ll=460&pl=300&r=LYNXNPEEAO0WM
    </image:loc>
        </image:image>
        <news:news>
            <news:publication>
                <news:name>Reuters</news:name>
                <news:language>eng</news:language>
            </news:publication>
            <news:publication_date>2018-11-26T02:55:00+00:00</news:publication_date>
            <news:title>
    Campbell Soup nears deal with Third Point to end board challenge: sources
    </news:title>
            <news:keywords>Headlines,Business, Industry</news:keywords>
            <news:stock_tickers>NYSE:CPB</news:stock_tickers>
        </news:news>
    </url>
</urlset>

Python code sample

import lxml.etree
import lxml.html
import requests

def main():
    r = requests.get("https://www.reuters.com/sitemap_news_index1.xml")

    namespace = "http://www.google.com/schemas/sitemap-news/0.9"
    root = lxml.etree.fromstring(r.content)


    records = root.xpath('//news:title', namespaces = {"news": "http://www.google.com/schemas/sitemap-news/0.9"})
    for record in records:
        print(record.text)


    records = root.xpath('//sitemap:loc', namespaces = {"sitemap": "http://www.sitemaps.org/schemas/sitemap/0.9"})
    for record in records:
        print(record.text)


if __name__ == "__main__":
    main()

Currently, I'm XPath to get all URL and title, but this is not what I want because I don't know which URL belongs to which title. My question is how to get each <url>, then loop each <url> as item to get corresponding <loc> and <news:keywords> etc. Thanks!

Edit: Expecting output

foreach <url>
      get <loc>
      get <news:publication_date>
      get <news:title>

Can you post an example of your expected output?

BernardL
– BernardL

2018-11-27 03:50:12 +00:00
Commented Nov 27, 2018 at 3:50 — BernardL
– BernardL, Commented Nov 27, 2018 at 3:50
@BernardL Expected output added.

Tester
– Tester

2018-11-27 04:02:15 +00:00
Commented Nov 27, 2018 at 4:02 — Tester
– Tester, Commented Nov 27, 2018 at 4:02

Tomalak · Accepted Answer · 2018-11-27 04:23:19Z

2

Use relative XPath to get from each title to its associated URL:

ns = {
    "news": "http://www.google.com/schemas/sitemap-news/0.9",
    "sitemap": "http://www.sitemaps.org/schemas/sitemap/0.9",
    "image": "http://www.google.com/schemas/sitemap-image/1.1"
}

r = requests.get("https://www.reuters.com/sitemap_news_index1.xml")
root = lxml.etree.fromstring(r.content)

for title in root.xpath('//news:title', namespaces=ns):
    print(title.text)

    loc = title.xpath('ancestor::sitemap:url/sitemap:loc', namespaces=ns)
    print(loc[0].text)

Exercise: Rewrite this to get from the URL to the associated title instead.

Note: The titles (and potentially the URLs as well) seem to be HTML-escaped. Use the unescape() function

from html import unescape

to unescape them.

edited Nov 27, 2018 at 4:23

answered Nov 27, 2018 at 4:17

Tomalak

339k68 gold badges547 silver badges635 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Eric Chow · Accepted Answer · 2021-01-11 03:16:32Z

The answer is

from datetime import datetime
from html import unescape
from lxml import etree
import requests

r = requests.get("https://www.reuters.com/sitemap_news_index1.xml")
root = etree.fromstring(r.content)

ns = {
    "news": "http://www.google.com/schemas/sitemap-news/0.9",
    "sitemap": "http://www.sitemaps.org/schemas/sitemap/0.9",
    "image": "http://www.google.com/schemas/sitemap-image/1.1"
}

for url in root.iterfind("sitemap:url", namespaces=ns):
    loc = url.findtext("sitemap:loc", namespaces=ns)
    print(loc)
    title = unescape(url.findtext("news:news/news:title", namespaces=ns))
    print(title)
    date = unescape(url.findtext("news:news/news:publication_date", namespaces=ns))
    date = datetime.strptime(date, '%Y-%m-%dT%H:%M:%S+00:00')
    print(date)

The rules of thumb are:

Try not use xpath. Instead of using xpath, use find, findall, iterfind. xpath is a more complex algorithm than just find, findall or iterfind and it takes more time and resources.

Use iterfind instead of using findall. Because iterfind will yield return the items. That is to say it will return one item at a time. Thus it uses less memory.

Use findtext if all you need is text.

A more general rule is to read the official document.

Firstly, let's create 3 for-loop function and compare them.

def for1():
    for url in root.iterfind("sitemap:url", namespaces=ns):
        pass

def for2():
    for url in root.findall("sitemap:url", namespaces=ns):
        pass

def for3():
    for url in root.xpath("sitemap:url", namespaces=ns):
        pass

function	time
`root.iterfind`	70.5 µs ± 543 ns
`root.findall`	72.3 µs ± 839 ns
`root.xpath`	84.8 µs ± 567 ns

We can see that iterfind is the fastest as expected.

Next, let's check the statements inside the for loop.

statement	time
`url.xpath('string(news:news/news:title)', namespaces=ns)`	15.7 µs ± 112 ns
`url_item.xpath('news:news/news:title', namespaces=ns)[0].text`	14.4 µs ± 53.7 ns
`url_item.find('news:news/news:title', namespaces=ns).text`	3.74 µs ± 60 ns
`url_item.findtext('news:news/news:title', namespaces=ns)`	3.71 µs ± 40.3 ns

From the above table, we can see that find/findtext is 4 times faster than xpath. And findtext is even faster than find.

This answer takes only 3.41 ms ± 53 µs, compared to Tomalak's 8.33 ms ± 52.4 µs

Collectives™ on Stack Overflow

Using lxml xpath to parse xml file

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related