0

Currently I’m working on a corpus/dataset. It’s in xml format as you can see the picture below. I’m facing a problem. I want to access all ‘ne’ elements one by one as shown in below picture. Then I want to access the text of the ‘W’ elements which are inside the ‘ne’ elements. Then I want to concatenate thy symbols ‘SDi’ and ‘EDi’ with the text of these ‘W’ elements. ‘i’ can take any positive whole number starting from 1. In the case of ‘SDi’ I need only the text of first ‘W’ element that is inside the ‘ne’ element. In the case of ‘EDi’ I need only the text of last ‘W’ element that is inside the ‘ne’ element. Currently I don't get anything as output after running the code. I think this is because of the fact that the element 'W' is never accessed. Moreover, i think that element 'W' is not accessed because it is a grandchild of element 'ne' therefore it can't be accessed directly rather it may be possible with the help its father node.

Note1: The number and names of sub elements inside ‘ne’ elements are not same.

Note2: Only those things are explained here which needed. You may find some other details in the coding/picture but ignore them.

I'm using Spyder (python 3.6) Any help would be appreciated.

A picture from the XML file I'm working on is given below: enter image description here

Text version of XML file: Click here

Sample/Expected output image (below): enter image description here

Coding I've done so far:

for i in range(len(List_of_root_nodes)):
true_false = True
current = List_of_root_nodes[i]
start_ID = current.PDante_ID
#print('start:', start_ID)  # For Testing
end_ID = None
number = str(i+1)  # This number will serve as i used with SD and ED that is (SDi and EDi)

discourse_starting_symbol = "SD" + number
discourse_ending_symbol = "ED" + number

while true_false:    
    if current.right_child is None:        
        end_ID = current.PDante_ID
        #print('end:', end_ID)  # For Testing
        true_false = False        
    else:        
        current = current.right_child

# Finding 'ne' element with id='start_ID'
ne_text = None
ne_id = None

for ne in myroot.iter('ne'):    
    ne_id = ne.get('id')

    # If ne_id matches with start_ID means the place where SDi is to be placed is found    
    if ne_id == start_ID:        
        for w in ne.iter('W'):            
            ne_text = str(w.text)            
            boundary_and_text = " " + str(discourse_starting_symbol) + " " + ne_text
            w.text = boundary_and_text
            break

    # If ne_id matches with end_ID means the place where EDi is to be placed is found

    # Some changes Required here: Here the 'EDi' will need to be placed after the last 'W' element.
    # So last 'W' element needs to be accessed
    if ne_id == end_ID:        
        for w in ne.iter('W'):            
            ne_text = str(w.text)            
            boundary_and_text = ne_text + " " + str(discourse_ending_symbol) + " "
            w.text = boundary_and_text
            break
4
  • 1
    Could you post a text version of your xml snippet or a link to it for testing? A sample of your expected output would also be helpful. Commented Oct 18, 2019 at 18:02
  • I've edited the post as per requirements so that you may help me. @ColeTierney Commented Oct 18, 2019 at 18:31
  • You should not post code as an image because:... And avoid us having to download your data. Embed a small sample (like your screenshots) as text in body of posts that can serve future readers should links go dead. Commented Oct 18, 2019 at 19:07
  • appreciate what you said but respectfully, @Parfait I don't think so I've posted my code (under the title of 'Coding I've done so far') as an image. Commented Oct 18, 2019 at 19:15

2 Answers 2

1

Something like this (a.xml is the XML you have uploaded):

Note the code is not using any external library.

import xml.etree.ElementTree as ET

SD = 'SD'
ED = 'ED'

root = ET.parse('a.xml')

counter = 1

for ne in root.findall('.//ne'):
    w_lst = ne.findall('.//W')
    if w_lst:
        w_lst[0].text = '{}{} {}'.format(SD, counter, w_lst[0].text)
        if len(w_lst) > 1:
            w_lst[-1].text = '{} {}{}'.format(w_lst[-1].text, ED, counter)
        counter += 1
ET.dump(root)
Sign up to request clarification or add additional context in comments.

4 Comments

Can you please explain the code you posted a little @balderman
Sure. The code loop over all be elements. For each ne it finds the w elements. If there is one w element ( under ne ) it sets the SD value. If there is more than one it goes to the last w and set the ED value. Does it work for you?
I understood 100% what you said but as i'm almost zero in coding therefore I don't understand the exact working of each line. can you please tell me a little more about the three braces that is '{}{} {}'. what are they doing here? are they providing some free space? also if you may tell me about the fuction 'format()'. what it does with the passed three arguments? combined them. yes? and finally why there is ET.dump(root)? is it because that we've made some changes to the xml file and now it needs to build/write again/ or something like that?
last time when i checked your posted solution i couldn't understand it and therefore i've not checked if it is working for me or not. but after thinking over it for a while, now i think this is something I wanted with less or more changes required. I'll let you know if it works for me and also with mark your solution as 'it worked'. thank you
1

Whenever you need to modify XML with various nuanced needs, consider XSLT, the special-purpose language designed to transform XML files. You can run XSLT 1.0 scripts with Python's third-party module, lxml (not built-in etree).

Specifically, call the identity transform to copy XML as is and then run the two templates to add SDI to text of very first <W> and very last EDI to text of last <W>. Solution will work if there are 10 or 10,000 <W> nodes, deeply nested or not.

To demonstrate with example data of StackOverflow's top Python and XSLT users, see online demo where SDI and EDI are added to first and last <user> node:

XSLT (save as .xsl file, a special .xml file to be loaded in Python)

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes"/>
  <xsl:strip-space elements="*"/>

  <!-- IDENTITY TRANSFORM -->    
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <!-- EDIT FIRST W NODE -->    
  <xsl:template match="W[count(preceding::W)=0]">
    <xsl:copy>
      <xsl:copy-of select="@*"/>
      <xsl:value-of select="concat('SDI ', text())"/>
    </xsl:copy>
  </xsl:template>

  <!-- EDIT LAST W NODE -->    
  <xsl:template match="W[count(preceding::W)+1 = count(//W)]">
    <xsl:copy>
      <xsl:copy-of select="@*"/>
      <xsl:value-of select="concat('EDI ', text())"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

Python (no loops or if/else logic)

import lxml.etree as et

doc = et.parse('/path/to/Input.xml')
xsl = et.parse('/path/to/Script.xsl')

# CONFIGURE TRANSFORMER
transform = et.XSLT(xsl)    

# TRANSFORM SOURCE DOC
result = transform(doc)

# OUTPUT TO CONSOLE
print(result)

# SAVE TO FILE
with open('Output.xml', 'wb') as f:
    f.write(result)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.