0
<root>
  <article>
    <front>
      <body>
        <back>
          <sec id="sec7" sec-type="funding">
            <title>Funding</title>
            <p>This work was supported by the NIH</p>
          </sec>
        </back>

I have an XML file of scientific journal metadata and am trying to extract just the funding information for each article. I need the info contained within the p tag. While the "sec id" varies between article, the "sec-type" is always "funding".

I have been trying to do this in Python3 using Element Tree.

import xml.etree.ElementTree as ET  

tree = ET.parse(journals.xml)
root = tree.getroot()
for title in root.iter("title"):
    ET.dump(title)

Any help would be greatly appreciated!

1
  • Can you give an example of full valid XML? Commented Jan 15, 2019 at 14:58

1 Answer 1

2

You can use findall with an XPath expression to extract the values you want. I extrapolated from your example data a little bit in order to complete the document and have two p elements:

<root>
  <article>
    <front>
      <body>
        <back>
          <sec id="sec7" sec-type="funding">
            <title>Funding</title>
            <p>This work was supported by the NIH</p>
          </sec>
          <sec id="sec8" sec-type="funding">
            <title>Funding</title>
            <p>I'm a little teapot</p>
          </sec>
        </back>
      </body>
    </front>
  </article>
</root>

The following extracts all of the text contents of p nodes under a sec node where sectype="funding":

import xml.etree.ElementTree as ET

doc = ET.parse('journals.xml')
print([p.text for p in doc.findall('.//sec[@sec-type="funding"]/p')])

Result:

['This work was supported by the NIH', "I'm a little teapot"]
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for your answer. Is there a way of combining this XPath expression with a simple search for a specific element's text so that for each article I get the title along with the corresponding funding info? for elem in tree.iter(tag='article-id'): print(elem.text) print([p.text for p in doc.findall('.//sec[@sec-type="funding"]/p')]) this separately gives me the article IDs and funding info but ideally I want these matching

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.