6

I have a large number of .xml files (about 70) and i need to extract some co-ordinates from them. Apparently the best way to do this is to parse the xml file using element tree. I am new to python (very very new!) and am having a difficult time understanding all of the documentation which comes with element tree! I was wondering if anyone had any code where they have used element tree or if anyone could explain to me how to go about it.. Thank you!

This is a sample from my XML file..

    <?xml version="1.0" encoding="UTF-8" ?> 
- <lev:Leveringsinformatie xmlns:lev="http://www.kadaster.nl/schemas/klic/20080722/leveringsinfo">
  <lev:Version>1.5</lev:Version> 
  <lev:Klicnummer>10G179900</lev:Klicnummer> 
  <lev:Ordernummer>0065491624</lev:Ordernummer> 
  <lev:RelatienummerGrondroerder>0000305605</lev:RelatienummerGrondroerder> 
  <lev:Leveringsvolgnummer>1</lev:Leveringsvolgnummer> 
  <lev:Meldingsoort>Graafmelding</lev:Meldingsoort> 
  <lev:DatumTijdAanvraag>2010-08-10T11:43:02.779+02:00</lev:DatumTijdAanvraag> 
  <lev:KlantReferentie>1207-0132-030 - 6</lev:KlantReferentie> 
- <lev:Locatie axisLabels="x y" srsDimension="2" srsName="epsg:28992" uomLabels="m m">
- <gml:exterior xmlns:gml="http://www.opengis.net/gml">
- <gml:LinearRing>
  <gml:posList>137800.0 484217.0 137796.0 484222.0 137832.0 483757.0 138178.0 483752.0 138174.0 484222.0 137800.0 484217.0</gml:posList> 
  </gml:LinearRing>
  </gml:exterior>
  </lev:Locatie>
- <lev:Pngformaat>
- <lev:OmsluitendeRechthoek xmlns:ns4="http://www.kadaster.nl/schemas/klic/20080722/madt" xmlns:bis="http://www.kadaster.nl/schemas/klic/20080722/klicnetbeheerdersinformatieservicetypes" xmlns:ns0="http://www.kadaster.nl/schemas/klic/20080722/gias" xmlns:ns7="http://www.kadaster.nl/schemas/klic/20080722/klicnetbeheerdersinformatieservicetypes" xmlns:madt="http://www.kadaster.nl/schemas/klic/20080722/madt" xmlns:gia="http://www.kadaster.nl/schemas/klic/20080722/gias" xmlns:klic="http://www.kadaster.nl/schemas/20080722/klic" xmlns:b="http://www.kadaster.nl/schemas/klic/20080722/bundelingtypes" xmlns:ns9="http://www.kadaster.nl/schemas/klic/20081010/bmkltypes" xmlns:gml="http://www.opengis.net/gml" xmlns:ns1="http://www.kadaster.nl/schemas/20080722/klic" xmlns:a="http://www.kadaster.nl/schemas/klic/20080722/bundelingservicetypes" xmlns:bmkl="http://www.kadaster.nl/schemas/klic/20081010/bmkltypes" xmlns:ns3="http://www.opengis.net/gml" xmlns:ns8="http://www.kadaster.nl/schemas/klic/20080722/knts">
- <gml:Envelope srsDimension="2" srsName="epsg:28992">
  <gml:lowerCorner>137796 483752</gml:lowerCorner> 
  <gml:upperCorner>138178 484222</gml:upperCorner> 
  </gml:Envelope>
  </lev:OmsluitendeRechthoek>
  <lev:PixelsBreed>5348</lev:PixelsBreed> 
  <lev:PixelsHoog>6580</lev:PixelsHoog> 
  </lev:Pngformaat>
- <lev:NetbeheerderLeveringen>
- <lev:NetbeheerderLevering>
  <lev:RelatienummerNetbeheerder>0000578695</lev:RelatienummerNetbeheerder> 
  <lev:Bedrijfsnaam>Gemeente Almere</lev:Bedrijfsnaam> 
  <lev:BedrijfsnaamAfkorting>Gemeente Almere</lev:BedrijfsnaamAfkorting> 

I need to extract the lower and upper corner co-ordinates (lowerCorner/upperCorner)

Update: Here is my full script:

from xml.etree import ElementTree as ET
import sys, string, os, arcgisscripting
gp = arcgisscripting.create(9.3)

workspace = "D:/J040083"
gp.workspace = workspace

for root, dirs, filenames in os.walk(workspace): # returms root, dirs, and files
    for filename in filenames:
        filename_split = os.path.splitext(filename) # filename and extensionname (extension in [1])
        filename_zero = filename_split[0]
        extension = str.upper(filename_split[1])

        try:
            first_2_letters = str.upper(filename_zero[0] + filename_zero[1])
        except:
            first_2_letters = "XX"

        if first_2_letters == "LI" and extension == ".XML":
            tree = ET.parse(workspace)
            print tree.find('//{http://www.opengis.net/gml}lowerCorner').text
            print tree.find('//{http://www.opengis.net/gml}upperCorner').text

I am now getting the error:

Message File Name Line Position
Traceback
D:\J040083\TXT_EXTRACTION.py 32
parse C:\Python25\Lib\xml\etree\ElementTree.py 862
parse C:\Python25\Lib\xml\etree\ElementTree.py 579
IOError: [Errno 13] Permission denied: 'D:/J040083'

and now i am REALLY confused because i am able to access these files with a different script which is almost exactly the same as this one!!

15
  • 3
    Just so we're all on the same page, have you read the ElementTree documentation? That's a reference document but there are examples sprinkled throughout the page. For an intro, the ElementTree Overview page might be helpful too. Commented Jan 18, 2011 at 10:12
  • Embarrassingly yes i have read it! I just don't really understand it.. Commented Jan 18, 2011 at 10:34
  • 2
    @Alice: I suggest you post a small realistic snippet from an XML file you want to parse and specify the data you want to reach. You can do it by editing your own post. Commented Jan 18, 2011 at 10:52
  • I did try that but it just shows up in my question not in the correct format.. so instead of having the comments it just had the numbers! Commented Jan 18, 2011 at 10:57
  • 1
    @Alice Duff - if you're going to be doing a lot of work with GML then I'd recommend reading up on XML. GML can get fairly complex and you'll be pleased you got the XML fundamentals sorted out. I can't recommend any tutorials as it's been a while since I've looked at them, but avoid W3Schools (NOT linked with W3, who actually write the spec!) as they're frequently inaccurate. This is the first result that isn't W3Schools: learn-xml-tutorial.com Commented Jan 18, 2011 at 11:42

2 Answers 2

13

ElementTree can be tricky when namespaces are involved. The element you are looking for are named <gml:lowerCorner> and <gml:upperCorner>. Searching higher in the XML data, gml is defined as an XML namespace: xmlns:gml="http://www.opengis.net/gml". The way to find a subelement of the XML tree is as follows:

from xml.etree import ElementTree as ET
tree = ET.parse('file.xml')
print tree.find('//{http://www.opengis.net/gml}lowerCorner').text
print tree.find('//{http://www.opengis.net/gml}upperCorner').text

Output

137796 483752
138178 484222

Explanation

Using ElementTree's XPath support, // selects all subelements on all levels of the tree. ElementTree uses {url}tag notation for a tag in a specific namespace. gml's URL is http://www.opengis.net/gml. .text retrieves the data in the element.

Note that // is a shortcut to finding a nested node. The full path of upperCorner in ElementTree's syntax is actually:

{http://www.kadaster.nl/schemas/klic/20080722/leveringsinfo}Pngformaat/{http://www.kadaster.nl/schemas/klic/20080722/leveringsinfo}OmsluitendeRechthoek/{http://www.opengis.net/gml}Envelope/{http://www.opengis.net/gml}upperCorner
Sign up to request clarification or add additional context in comments.

Comments

2

Using ElementTree is very simple, basically you create an object parsed from a file, find elements by name or path, and get their text or attribute.

In your case it's a bit more complicated because you have namespaces in your file, so we have to transform the path from the form ns:tag to the form {uri}tag. This the aim of the transform_path function

NS_MAP = {
    'http://www.kadaster.nl/schemas/klic/20080722/leveringsinfo' : 'lev',
    'http://www.opengis.net/gml' : 'gml',
}
INV_NS_MAP = {v:k for k, v in NS_MAP.items()} #inverse ns_map 
#for python2: INV_NS_MAP = dict((v,k) for k, v in NS_MAP.iteritems())

#ElementTree expect tags in form {uri}tag, but it would be a pain to have complete uri for eache tag
def transform_path (path):
    res = ''
    tags = path.split('/')
    for tag in tags:
      ns, tag = tag.split(':')
      res += "{"+INV_NS_MAP[ns]+"}"+tag+'/'
    return res

import xml.etree.ElementTree as ET
tree = ET.parse('test.xml')
doc = tree.getroot()

lowerCorner = doc.find(transform_path("lev:Pngformaat/lev:OmsluitendeRechthoek/gml:Envelope/gml:lowerCorner"))
upperCorner = doc.find(transform_path("lev:Pngformaat/lev:OmsluitendeRechthoek/gml:Envelope/gml:upperCorner"))
print (lowerCorner.text)         # Print coordinates
print (upperCorner.text)         # Print coordinates

#for python2: print elem.text

Running the script with you file will give the following output:

137796 483752
138178 484222

4 Comments

Thanks Charles, I am trying to run your code but it keeps giving me the error "Invalid Syntax" for the final line!
im having some trouble making this script work.. Now i get an "Invalid Syntax" error for the second from last line..?
I think it should work i just dont understand how to make it work with my data - i will try doing some research and hopefully i will understand!!
I made a small script that reads the coordinates of your file

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.