How to modify the text of nested elements in xml file using python?

Question

Currently I’m working on a corpus/dataset. It’s in xml format as you can see the picture below. I’m facing a problem. I want to access all ‘ne’ elements one by one as shown in below picture. Then I want to access the text of the ‘W’ elements which are inside the ‘ne’ elements. Then I want to concatenate thy symbols ‘SDi’ and ‘EDi’ with the text of these ‘W’ elements. ‘i’ can take any positive whole number starting from 1. In the case of ‘SDi’ I need only the text of first ‘W’ element that is inside the ‘ne’ element. In the case of ‘EDi’ I need only the text of last ‘W’ element that is inside the ‘ne’ element. Currently I don't get anything as output after running the code. I think this is because of the fact that the element 'W' is never accessed. Moreover, i think that element 'W' is not accessed because it is a grandchild of element 'ne' therefore it can't be accessed directly rather it may be possible with the help its father node.

Note1: The number and names of sub elements inside ‘ne’ elements are not same.

Note2: Only those things are explained here which needed. You may find some other details in the coding/picture but ignore them.

I'm using Spyder (python 3.6) Any help would be appreciated.

A picture from the XML file I'm working on is given below:

Text version of XML file: Click here

Sample/Expected output image (below):

Coding I've done so far:

for i in range(len(List_of_root_nodes)):
true_false = True
current = List_of_root_nodes[i]
start_ID = current.PDante_ID
#print('start:', start_ID)  # For Testing
end_ID = None
number = str(i+1)  # This number will serve as i used with SD and ED that is (SDi and EDi)

discourse_starting_symbol = "SD" + number
discourse_ending_symbol = "ED" + number

while true_false:    
    if current.right_child is None:        
        end_ID = current.PDante_ID
        #print('end:', end_ID)  # For Testing
        true_false = False        
    else:        
        current = current.right_child

# Finding 'ne' element with id='start_ID'
ne_text = None
ne_id = None

for ne in myroot.iter('ne'):    
    ne_id = ne.get('id')

    # If ne_id matches with start_ID means the place where SDi is to be placed is found    
    if ne_id == start_ID:        
        for w in ne.iter('W'):            
            ne_text = str(w.text)            
            boundary_and_text = " " + str(discourse_starting_symbol) + " " + ne_text
            w.text = boundary_and_text
            break

    # If ne_id matches with end_ID means the place where EDi is to be placed is found

    # Some changes Required here: Here the 'EDi' will need to be placed after the last 'W' element.
    # So last 'W' element needs to be accessed
    if ne_id == end_ID:        
        for w in ne.iter('W'):            
            ne_text = str(w.text)            
            boundary_and_text = ne_text + " " + str(discourse_ending_symbol) + " "
            w.text = boundary_and_text
            break

Could you post a text version of your xml snippet or a link to it for testing? A sample of your expected output would also be helpful. — Cole Tierney
– Cole Tierney, Commented Oct 18, 2019 at 18:02
I've edited the post as per requirements so that you may help me. @ColeTierney — Muhammad Aatif
– Muhammad Aatif, Commented Oct 18, 2019 at 18:31
You should not post code as an image because:... And avoid us having to download your data. Embed a small sample (like your screenshots) as text in body of posts that can serve future readers should links go dead. — Parfait
– Parfait, Commented Oct 18, 2019 at 19:07
appreciate what you said but respectfully, @Parfait I don't think so I've posted my code (under the title of 'Coding I've done so far') as an image. — Muhammad Aatif
– Muhammad Aatif, Commented Oct 18, 2019 at 19:15

balderman · Accepted Answer · 2019-10-19 07:28:49Z

1

Something like this (a.xml is the XML you have uploaded):

Note the code is not using any external library.

import xml.etree.ElementTree as ET

SD = 'SD'
ED = 'ED'

root = ET.parse('a.xml')

counter = 1

for ne in root.findall('.//ne'):
    w_lst = ne.findall('.//W')
    if w_lst:
        w_lst[0].text = '{}{} {}'.format(SD, counter, w_lst[0].text)
        if len(w_lst) > 1:
            w_lst[-1].text = '{} {}{}'.format(w_lst[-1].text, ED, counter)
        counter += 1
ET.dump(root)

answered Oct 19, 2019 at 7:28

balderman

24k8 gold badges39 silver badges60 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Muhammad Aatif Over a year ago

Can you please explain the code you posted a little @balderman

balderman Over a year ago

Sure. The code loop over all be elements. For each ne it finds the w elements. If there is one w element ( under ne ) it sets the SD value. If there is more than one it goes to the last w and set the ED value. Does it work for you?

Muhammad Aatif Over a year ago

I understood 100% what you said but as i'm almost zero in coding therefore I don't understand the exact working of each line. can you please tell me a little more about the three braces that is '{}{} {}'. what are they doing here? are they providing some free space? also if you may tell me about the fuction 'format()'. what it does with the passed three arguments? combined them. yes? and finally why there is ET.dump(root)? is it because that we've made some changes to the xml file and now it needs to build/write again/ or something like that?

Muhammad Aatif Over a year ago

last time when i checked your posted solution i couldn't understand it and therefore i've not checked if it is working for me or not. but after thinking over it for a while, now i think this is something I wanted with less or more changes required. I'll let you know if it works for me and also with mark your solution as 'it worked'. thank you

Parfait · Accepted Answer · 2019-10-18 21:35:17Z

Whenever you need to modify XML with various nuanced needs, consider XSLT, the special-purpose language designed to transform XML files. You can run XSLT 1.0 scripts with Python's third-party module, lxml (not built-in etree).

Specifically, call the identity transform to copy XML as is and then run the two templates to add SDI to text of very first <W> and very last EDI to text of last <W>. Solution will work if there are 10 or 10,000 <W> nodes, deeply nested or not.

To demonstrate with example data of StackOverflow's top Python and XSLT users, see online demo where SDI and EDI are added to first and last <user> node:

XSLT (save as .xsl file, a special .xml file to be loaded in Python)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes"/>
  <xsl:strip-space elements="*"/>

  <!-- IDENTITY TRANSFORM -->    
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <!-- EDIT FIRST W NODE -->    
  <xsl:template match="W[count(preceding::W)=0]">
    <xsl:copy>
      <xsl:copy-of select="@*"/>
      <xsl:value-of select="concat('SDI ', text())"/>
    </xsl:copy>
  </xsl:template>

  <!-- EDIT LAST W NODE -->    
  <xsl:template match="W[count(preceding::W)+1 = count(//W)]">
    <xsl:copy>
      <xsl:copy-of select="@*"/>
      <xsl:value-of select="concat('EDI ', text())"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

Python (no loops or if/else logic)

import lxml.etree as et

doc = et.parse('/path/to/Input.xml')
xsl = et.parse('/path/to/Script.xsl')

# CONFIGURE TRANSFORMER
transform = et.XSLT(xsl)    

# TRANSFORM SOURCE DOC
result = transform(doc)

# OUTPUT TO CONSOLE
print(result)

# SAVE TO FILE
with open('Output.xml', 'wb') as f:
    f.write(result)

Collectives™ on Stack Overflow

How to modify the text of nested elements in xml file using python?

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related