parsing XML using cElementTree in python

Question

I have a problem with parsing an XML file using python, namely - syntax.

My XML files look like this one:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chunkList SYSTEM "ccl.dtd">
<chunkList>
 <chunk id="ch1" type="p">
  <sentence id="s1">
   <tok>
    <orth>dzisiaj</orth>
    <lex disamb="1"><base>dzisiaj</base><ctag>adv:pos</ctag></lex>
   </tok>
   <tok>
    <orth>uczę</orth>
    <lex disamb="1"><base>uczyć</base><ctag>fin:sg:pri:imperf</ctag></lex>
    <prop key="sense:ukb:syns_id">1449</prop>
    <prop key="sense:ukb:syns_rank">1449/0.3151019143 52662/0.2635524432 58124/0.2227816629 58122/0.1985639796</prop>
    <prop key="sense:ukb:unitsstr">szkolić.1(29:cumy) nauczać.1(29:cumy) kształcić.1(29:cumy) edukować.1(29:cumy) uczyć.1(29:cumy)</prop>
   </tok>
   <tok>
    <orth>się</orth>
    <lex disamb="1"><base>się</base><ctag>qub</ctag></lex>
   </tok>
   <tok>
    <orth>o</orth>
    <lex disamb="1"><base>o</base><ctag>prep:acc</ctag></lex>
   </tok>
   <tok>
    <orth>świecie</orth>
    <lex disamb="1"><base>świat</base><ctag>subst:sg:loc:m3</ctag></lex>
    <prop key="sense:ukb:syns_id">7826</prop>
    <prop key="sense:ukb:syns_rank">7826/0.1761356163 43462/0.1512730526 8139/0.1506959982 8361/0.1446884158 3331/0.1435643398 10819/0.1251661757 3332/0.1084764017</prop>
    <prop key="sense:ukb:unitsstr">cywilizacja.1(11:grp) krąg_kulturowy.1(11:grp) kultura.3(11:grp) krąg_cywilizacyjny.1(17:rsl) świat.2(11:grp)</prop>
   </tok>
  </sentence>
 </chunk>
</chunkList>

And what i need, is a list containing tuples with two values, <orth> and <prop key="sense:ukb:syns_id">. Every tuple is for every <tok> tag. so, for example for second <tok> i need a result like (uczę, 1449).

I have written pseudocode about how i think it should work, but i have no idea how to implement it using ET.

Here it is:

ResultArray=[]

def treeSearch(root):
    for element in root:
        if element == 'tok':
            temp1=0
            temp2=0
            for tokens in element:
                if token == 'orth':
                    temp1=token.value()
                if token == 'prop key="sense:ukb:syns_id"':
                    temp2=token.value()
            tempTuple=(temp1,temp2)
            resultArray.append(tempTuple)

    return ResultArray

so you want to capture only the 1st tag <prop> within <tok> tag? — RomanPerekhrest
– RomanPerekhrest, Commented Nov 11, 2017 at 16:51
I want to capture <orth> and <prop key="sense:ukb:syns_id"> in every <tok> and save it as a tuple. — AwangardowyKaloryfer
– AwangardowyKaloryfer, Commented Nov 11, 2017 at 16:53
<orth> is always there, but if there is no <prop key=...>, i can just use 0. — AwangardowyKaloryfer
– AwangardowyKaloryfer, Commented Nov 11, 2017 at 16:56

RomanPerekhrest · Accepted Answer · 2017-11-11 19:59:24Z

1

With xml.etree.ElementTree module:

import xml.etree.ElementTree as ET

root = ET.parse('input.xml').getroot()
result = []
for tok in root.findall('.//tok'):
    result.append((tok.findtext('orth'), tok.findtext('prop[@key="sense:ukb:syns_id"]') or 0))

print(result)

The output:

[('dzisiaj', 0), ('uczę', '1449'), ('się', 0), ('o', 0), ('świecie', '7826')]

Details:

for tok in root.findall('.//tok'): - iterating through all <tok> tags
tok.findtext('orth') - will get text content of the <orth> tag withing current processed <tok> tag
tok.findtext('prop[@key="sense:ukb:syns_id"]') or 0 - get text content of the <prop> tag with specified key attribute. If it's not exists - leave the value as zero 0

edited Nov 11, 2017 at 19:59

answered Nov 11, 2017 at 18:26

RomanPerekhrest

93.1k4 gold badges75 silver badges112 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

AwangardowyKaloryfer Over a year ago

I HAVE BEEN TRYING FOR HOURS TO MAKE THIS WORK, AND THIS DOES. God bless you sir. Would you like to explain it tho?

RomanPerekhrest Over a year ago

@AwangardowyKaloryfer, you're welcome. See my details

Collectives™ on Stack Overflow

parsing XML using cElementTree in python

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related