1

I have a .tmx file, and I want to extract the text from the seg tag, however because of the inside tags such as bpt and ept, I cannot extract this text. So I would like to remove the bpt tag completely. I tried .remove() method. However, this also removes the text.

I cannot use BeautifulSoup because my original file is .tmx

3
  • Can you use lxml ? The extension file is not a problem, as with lxml you can parse xml contained in a string. It is usually better to use a well-tested and very used library than write your own code. Commented May 14, 2021 at 11:24
  • @Wind56 I updated my explanation Commented May 14, 2021 at 11:36
  • Ok thanks, I will look your issue when i have more time. Also consider posting an issue to inform the maintainers that there is some issues with encoding if you are sure of what you say. Commented May 14, 2021 at 11:39

1 Answer 1

1

ElementTree does not keep parent references in the XML tree. That's inconvenient but not the end of the world.

But in order to delete any node in an XML document, you need to delete it from its parent, so you need a way to get the parent node.

Easiest for ElementTree is to iterate all potential parents and then check each parent if it has a child you want to delete.

Assuming <bpt> is always a child of <seg>, that would mean iterating the <seg> elements:

for node in root.iter('seg'):
    prev = None
    for child in list(node):
        if child.tag == 'bpt':
            # retain child node's tail, if any
            if child.tail is not None:
                if prev is None:
                    node.text = (node.text if node.text else '') + child.tail 
                else:
                    prev.tail = (prev.tail if prev.tail else '') + child.tail
            node.remove(child)
        else:
            prev = child

If <bpt> could be anywhere, changing the above to for node in root.iter(): iterates all nodes.

Explanation

ElementTree sub-divides the document tree in a very proprietary manner. One main drawback is that there are no "parent" references - relative navigation between nodes is very limited in general - another is that there are no text nodes.

Instead of being a stand-alone node, any text after an element (i.e. text directly following the closing </tag>) becomes a property of that element, called .tail:

<!-- <bpt> elements and their "tails" -->

<seg><bpt i="1">{\\f3 </bpt>Cover page <ept i="1">}</ept><bpt i="2">{\\f2 </bpt>U1 - Insert graphic<ept i="2">}</ept></seg>
<!-- -----------------------^^^^^^^^^^^                  -----------------------^^^^^^^^^^^^^^^^^^^                     -->

Consequently, if we remove the <bpt> element, the tail is lost, too. In order to save it, we must add the content to the preceding element's tail (as with "U1 - Insert graphic", which now belongs to the <ept>), or if there is no preceding element, to the parent element's text (as with "Cover page ", which now belongs to the <seg>):

<!-- <bpt> elements removed, "tails" moved one to the front -->

<seg>Cover page <ept i="1">}</ept>U1 - Insert graphic<ept i="2">}</ept></seg>
<!-- ^^^^^^^^^^^                  ^^^^^^^^^^^^^^^^^^^                     -->

Repeating the same removal process with <ept> would lead to the follwing - all "tails" are now merged into <seg>'s text:

<seg>Cover page U1 - Insert graphic</seg>
<!-- ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^   -->
Sign up to request clarification or add additional context in comments.

8 Comments

@StarryNight See updated code, explanation follows.
Yes it fixed the error. But now it does not delete the second bpt tag in the .tmx file. It does delete it in the string tho. :(
@StarryNight It removes the second <bpt> node for me. Look closely, your code must be different.
I think is the same. I uploaded the code and the result above. Thank you again for still being here and trying to help :)
I tried for node in root.iter(): as well. It gives me the same result :(
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.