Python xml: parse attributes

Question

I have a newspaper in xml format and I am trying to parse specific parts.

My XML looks like the following:

<?xml version="1.0" encoding="UTF-8"?>
<articles>
   <text>
      <text.cr>
         <pg pgref="1" clipref="1" pos="0,0,2275,3149"/>
         <p type="none">
            <wd pos="0,0,0,0"/>
         </p>
      </text.cr>
      <text.cr>
         <pg pgref="1" clipref="2" pos="0,0,2275,3149"/>
         <p type="none">
            <wd pos="0,0,0,0"/>
         </p>
      </text.cr>
      <text.cr>
         <pg pgref="1" clipref="3" pos="4,32,1078,454"/>
         <p type="none">
            <wd pos="4,32,1078,324">The</wd>
            <wd pos="12,234,1078,450">Newspaper</wd>
         </p>
      </text.cr>

I want to parse "The" and "Newspaper" amongst others. I used xml.etree.ElementTree and my code looks like this:

import xml.etree.ElementTree as ET

for each_file in entries:
                                               
                        mytree = ET.parse(path.xml)
                        tree = mytree.findall('text')
                        

                        for x in tree:
                            x_ = x.findall('wd')

I managed to parse the root and also the attributes, but I don't know how to address 'wd' Thanks for the help

Jack Fleeting · Accepted Answer · 2020-07-01 13:44:45Z

1

Change your loop to

for x in tree:
  x_ = x.findall('.//wd')
  for t in x_:
      if t.text is not None:
          print(t.text)

Output:

The
Newspaper

answered Jul 1, 2020 at 13:44

Jack Fleeting

25k6 gold badges27 silver badges49 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

balderman · Accepted Answer · 2020-07-01 13:58:10Z

Below

import xml.etree.ElementTree as ET

xml = '''<?xml version="1.0" encoding="UTF-8"?>
<articles>
   <text>
      <text.cr>
         <pg pgref="1" clipref="1" pos="0,0,2275,3149"/>
         <p type="none">
            <wd pos="0,0,0,0"/>
         </p>
      </text.cr>
      <text.cr>
         <pg pgref="1" clipref="2" pos="0,0,2275,3149"/>
         <p type="none">
            <wd pos="0,0,0,0"/>
         </p>
      </text.cr>
      <text.cr>
         <pg pgref="1" clipref="3" pos="4,32,1078,454"/>
         <p type="none">
            <wd pos="4,32,1078,324">The</wd>
            <wd pos="12,234,1078,450">Newspaper</wd>
         </p>
      </text.cr></text></articles>'''

values = ['The', 'Newspaper']
root = ET.fromstring(xml)
wds = [wd for wd in root.findall('.//wd') if wd.text in values]
for wd in wds:
    print(wd.attrib['pos'])

output

4,32,1078,324
12,234,1078,450

Collectives™ on Stack Overflow

Python xml: parse attributes

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related