1

I have a problem with parsing an XML file using python, namely - syntax.

My XML files look like this one:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chunkList SYSTEM "ccl.dtd">
<chunkList>
 <chunk id="ch1" type="p">
  <sentence id="s1">
   <tok>
    <orth>dzisiaj</orth>
    <lex disamb="1"><base>dzisiaj</base><ctag>adv:pos</ctag></lex>
   </tok>
   <tok>
    <orth>uczę</orth>
    <lex disamb="1"><base>uczyć</base><ctag>fin:sg:pri:imperf</ctag></lex>
    <prop key="sense:ukb:syns_id">1449</prop>
    <prop key="sense:ukb:syns_rank">1449/0.3151019143 52662/0.2635524432 58124/0.2227816629 58122/0.1985639796</prop>
    <prop key="sense:ukb:unitsstr">szkolić.1(29:cumy) nauczać.1(29:cumy) kształcić.1(29:cumy) edukować.1(29:cumy) uczyć.1(29:cumy)</prop>
   </tok>
   <tok>
    <orth>się</orth>
    <lex disamb="1"><base>się</base><ctag>qub</ctag></lex>
   </tok>
   <tok>
    <orth>o</orth>
    <lex disamb="1"><base>o</base><ctag>prep:acc</ctag></lex>
   </tok>
   <tok>
    <orth>świecie</orth>
    <lex disamb="1"><base>świat</base><ctag>subst:sg:loc:m3</ctag></lex>
    <prop key="sense:ukb:syns_id">7826</prop>
    <prop key="sense:ukb:syns_rank">7826/0.1761356163 43462/0.1512730526 8139/0.1506959982 8361/0.1446884158 3331/0.1435643398 10819/0.1251661757 3332/0.1084764017</prop>
    <prop key="sense:ukb:unitsstr">cywilizacja.1(11:grp) krąg_kulturowy.1(11:grp) kultura.3(11:grp) krąg_cywilizacyjny.1(17:rsl) świat.2(11:grp)</prop>
   </tok>
  </sentence>
 </chunk>
</chunkList>

And what i need, is a list containing tuples with two values, <orth> and <prop key="sense:ukb:syns_id">. Every tuple is for every <tok> tag. so, for example for second <tok> i need a result like (uczę, 1449).

I have written pseudocode about how i think it should work, but i have no idea how to implement it using ET.

Here it is:

ResultArray=[]

def treeSearch(root):
    for element in root:
        if element == 'tok':
            temp1=0
            temp2=0
            for tokens in element:
                if token == 'orth':
                    temp1=token.value()
                if token == 'prop key="sense:ukb:syns_id"':
                    temp2=token.value()
            tempTuple=(temp1,temp2)
            resultArray.append(tempTuple)

    return ResultArray
3
  • so you want to capture only the 1st tag <prop> within <tok> tag? Commented Nov 11, 2017 at 16:51
  • I want to capture <orth> and <prop key="sense:ukb:syns_id"> in every <tok> and save it as a tuple. Commented Nov 11, 2017 at 16:53
  • <orth> is always there, but if there is no <prop key=...>, i can just use 0. Commented Nov 11, 2017 at 16:56

1 Answer 1

1

With xml.etree.ElementTree module:

import xml.etree.ElementTree as ET

root = ET.parse('input.xml').getroot()
result = []
for tok in root.findall('.//tok'):
    result.append((tok.findtext('orth'), tok.findtext('prop[@key="sense:ukb:syns_id"]') or 0))

print(result)

The output:

[('dzisiaj', 0), ('uczę', '1449'), ('się', 0), ('o', 0), ('świecie', '7826')]

Details:

  • for tok in root.findall('.//tok'): - iterating through all <tok> tags
  • tok.findtext('orth') - will get text content of the <orth> tag withing current processed <tok> tag
  • tok.findtext('prop[@key="sense:ukb:syns_id"]') or 0 - get text content of the <prop> tag with specified key attribute. If it's not exists - leave the value as zero 0
Sign up to request clarification or add additional context in comments.

2 Comments

I HAVE BEEN TRYING FOR HOURS TO MAKE THIS WORK, AND THIS DOES. God bless you sir. Would you like to explain it tho?
@AwangardowyKaloryfer, you're welcome. See my details

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.