0

I have a problem with my Python parsing. I have this kind of xml file:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE Trans SYSTEM "trans-14.dtd">
<Trans scribe="maria" audio_filename="agora_2007_11_05_a" version="11" version_date="080826" xml:lang="catalan">
<Topics>
<Topic id="to1" desc="music"/>
<Topic id="to2" desc="bgnoise"/>
<Topic id="to4" desc="silence"/>
<Topic id="to5" desc="speech"/>
<Topic id="to6" desc="speech+music"/>
</Topics>
<Speakers>
<Speaker id="spk1" name="Xavi Coral" check="no" type="male" dialect="native" accent="catalan" scope="local"/>
<Speaker id="spk2" name="Ferran Martínez" check="no" type="male" dialect="native" accent="catalan" scope="local"/>
<Speaker id="spk3" name="Jordi Barbeta" check="no" type="male" dialect="native" accent="catalan" scope="local"/>
</Speakers>
<Section type="report" topic="to6" startTime="111.286" endTime="119.308">
<Turn speaker="spk1" startTime="111.286" endTime="119.308" mode="planned" channel="studio">
<Sync time="111.286"/>
ha estat director del diari La Vanguàrdia,
<Sync time="113.56"/>
ha estat director general de Barcelona Televisió i director del Centre Territorial de Televisió Espanyola a Catalunya,
<Sync time="119.308"/>
actualment col·labora en el diari 
<Event desc="es" type="language" extent="begin"/>
El Periódico
<Event desc="es" type="language" extent="end"/>
de Catalunya.
</Turn>
</Section>

And this is my Python code:

import xml.etree.ElementTree as etree
import os
import sys

xmlD = etree.parse(sys.stdin)
root = xmlD.getroot()
sections = root.getchildren()[2].getchildren()
for section in sections:
 turns = section.getchildren()
 for turn in turns:
    speaker = turn.get('speaker')
    mode = turn.get('mode')
    childs = turn.getchildren()
    for child in childs:
        time = child.get('time')
        opt = child.get('desc')
        extent = child.get('extent')

        if opt == 'es' and extent == 'begin':
            opt = "ESP:"
        elif opt == "la" extent == 'begin':
            opt = "LAT:"
        elif opt == "en" extent == 'begin':
            opt = "ENG:"
        else:
            opt = ""

        if time:
            time = time
        else:
            time = ""

        print time, opt+child.tail.encode('latin-1')

I need to mark the words pronounced in other language with this tag LANG: For example: spanish words ENG:hello, spanish words, but when I have 2 consecutive words pronounced in other language I don't know how to do this: spanish words ENG:hello ENG:man, spanish words . The change of language is in the Event xml tag.

Now, at the Output I have: actualment col·labora en el diari ESP:El Periódico de Catalunya. and I want: actualment col·labora en el diari ESP:El ESP:Periódico de Catalunya.

Anyone could help me?

Thank you!

1
  • Could you please update what you are expecting as the output with examples. Commented Jun 16, 2015 at 11:54

1 Answer 1

1

You can do something like -

print time, opt+(" " + opt).join([c.encode('latin-1').decode('latin-1') for c in child.tail.split(' ')])

instead of your print statement

Sign up to request clarification or add additional context in comments.

1 Comment

I think the "decode" part, because I only put the "encode" one and it works :) Thank you!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.