1

I have an XML file with the following structure:

<Thread THREAD_SEQUENCE="Q268_R16">
<RelQuestion RELQ_ID="Q268_R16">
<RelQSubject>Best Bank.</RelQSubject>
<RelQBody>Hi ti all QL's; What bank you are using? and why? Are you using this bank just because it has an affiliate at home? Regards;</RelQBody>
</RelQuestion>
</Thread>

In the XML file, there are 244 RelQBody tags. What I want to do is getting the text inside the RelQBody tag. I have tried something like this:

import xml.dom.minidom
dom = xml.dom.minidom.parse("test.xml")
data = dom.documentElement

question = data.getElementsByTagName("RelQBody")
i=1
for q in question:
    print("%i. %s" % (i, q.childNodes[0].data))
    i = i+1

But i keep getting an error saying

Traceback (most recent call last):
File "C:\Users\Administrator\Documents\python\test.py", line 13, in <module>
  print("%i. %s" % (i, q.childNodes[0].data))
IndexError: list index out of range

However, when i tried this code:

import xml.dom.minidom
dom = xml.dom.minidom.parse("test.xml")
data = dom.documentElement

question = data.getElementsByTagName("RelQBody")
i=1
for q in question:
    print("%i" % i)
    i = i+1

i got number 1-244. it is exactly the same as in the dataset.

So why there's a difference when i print out with the string and without the string? Maybe someone can tell me which part did i do wrong? I'm new to Python so any help will be appreciated. Thanks.

2 Answers 2

1
import xml.dom.minidom
dom = xml.dom.minidom.parse("test.xml")
data = dom.documentElement

question = data.getElementsByTagName("RelQBody")
for i,q in enumerate(question):
    if len(q.childNodes) > 0:
        print("%i. %s" % (i+1, q.childNodes[0].data))
Sign up to request clarification or add additional context in comments.

1 Comment

Further question. In case of empty RelQBody tags, i want to use the text inside RelQSubject as the question. I create a code like this: for i in range(len(qbody)): if len(qbody[i].childNodes) > 0: question.append(qbody[i].childNodes[0].data.lower()) else: question.append(qsubject[i].childNodes[0].data.lower()) is there any way better to achieve what i want?
1

i'm guessing the blame is childNodes[0], because maybe one of the nodes has 0 children, and calling childNodes[0] will result in IndexError

So try this:

import xml.dom.minidom
dom = xml.dom.minidom.parse("test.xml")
data = dom.documentElement

question = data.getElementsByTagName("RelQBody")
i=1
for q in question:
    if len(q.childNodes) > 0:
        print("%i. %s" % (i, q.childNodes[0].data))
    i = i+1

1 Comment

i just look further down the XML, and yes.. there are 2 threads with empty RelQBody. I guess i have more work to be done in preprocessing this file lol

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.