16

I am new to programming in python,´and i have some troubles understanding the concept. I wish to compare two xml files. These xml files are quite large. I will give an example for the type of files i wish to compare.

xmlfile1:

<xml>
    <property1>
          <property2>    
               <property3>

               </property3>
          </property2>    
    </property1>    
</xml>

xml file2:

<xml>
    <property1>
        <property2>    
            <property3> 
                <property4>

                </property4>    
            </property3>
        </property2>    
    </property1>

</xml>

the property1,property2 that i have named are different from the ones that are actually in the file. There are a lot of properties within the xml file. ANd i wish to compare the two xml files.

I am using an lxml parser to try to compare the two files and to print out the difference between them.

I do not know how to parse it and compare it automatically.

I tried reading through the lxml parser, but i couldnt understand how to use it to my problem.

Can someone please tell me how should i proceed with this problem.

Code snippets can be very useful

One more question, Am i following the right concept or i am missing something else? Please correct me of any new concepts that you knwo about

2
  • What are you looking for in the output - if its just a difference you might want to use diff in linux or fc in windows Commented Jun 30, 2014 at 14:59
  • actually i want to know what part of the file has been changed. Commented Jun 30, 2014 at 19:45

4 Answers 4

12

This is actually a reasonably challenging problem (due to what "difference" means often being in the eye of the beholder here, as there will be semantically "equivalent" information that you probably don't want marked as differences).

You could try using xmldiff, which is based on work in the paper Change Detection in Hierarchically Structured Information.

Sign up to request clarification or add additional context in comments.

4 Comments

xmldiff is GPL. Does this mean I have to open source my source code, if I use it?
Necro-responding for reference: GPL means you will have to provide source code to a user, if requested from him/her to do so. It doesn't mean you have to make it public to everyone (nor free), and you can always put additional restrictions on the user with additional contracts.
@guettli Also, additional note. Since it is LGPL (strict GPL) all the code you will open-source with the usage of xmldiff will have to be under LGPL as well. If you don't plan to open-source your project, just use it as it is.
Also, worth mentioning that from xmldiff version 2.0 project was moved to MIT license which should save troubles license-wise.
7

My approach to the problem was transforming each XML into a xml.etree.ElementTree and iterating through each of the layers. I also included the functionality to ignore a list of attributes while doing the comparison.

The first block of code holds the class used:

import xml.etree.ElementTree as ET
import logging

class XmlTree():

    def __init__(self):
        self.hdlr = logging.FileHandler('xml-comparison.log')
        self.formatter = logging.Formatter('%(asctime)s %(levelname)s %(message)s')

    @staticmethod
    def convert_string_to_tree( xmlString):

        return ET.fromstring(xmlString)

    def xml_compare(self, x1, x2, excludes=[]):
        """
        Compares two xml etrees
        :param x1: the first tree
        :param x2: the second tree
        :param excludes: list of string of attributes to exclude from comparison
        :return:
            True if both files match
        """

        if x1.tag != x2.tag:
            self.logger.debug('Tags do not match: %s and %s' % (x1.tag, x2.tag))
            return False
        for name, value in x1.attrib.items():
            if not name in excludes:
                if x2.attrib.get(name) != value:
                    self.logger.debug('Attributes do not match: %s=%r, %s=%r'
                                 % (name, value, name, x2.attrib.get(name)))
                    return False
        for name in x2.attrib.keys():
            if not name in excludes:
                if name not in x1.attrib:
                    self.logger.debug('x2 has an attribute x1 is missing: %s'
                                 % name)
                    return False
        if not self.text_compare(x1.text, x2.text):
            self.logger.debug('text: %r != %r' % (x1.text, x2.text))
            return False
        if not self.text_compare(x1.tail, x2.tail):
            self.logger.debug('tail: %r != %r' % (x1.tail, x2.tail))
            return False
        cl1 = x1.getchildren()
        cl2 = x2.getchildren()
        if len(cl1) != len(cl2):
            self.logger.debug('children length differs, %i != %i'
                         % (len(cl1), len(cl2)))
            return False
        i = 0
        for c1, c2 in zip(cl1, cl2):
            i += 1
            if not c1.tag in excludes:
                if not self.xml_compare(c1, c2, excludes):
                    self.logger.debug('children %i do not match: %s'
                                 % (i, c1.tag))
                    return False
        return True

    def text_compare(self, t1, t2):
        """
        Compare two text strings
        :param t1: text one
        :param t2: text two
        :return:
            True if a match
        """
        if not t1 and not t2:
            return True
        if t1 == '*' or t2 == '*':
            return True
        return (t1 or '').strip() == (t2 or '').strip()

The second block of code holds a couple of XML examples and their comparison:

xml1 = "<note><to>Tove</to><from>Jani</from><heading>Reminder</heading><body>Don't forget me this weekend!</body></note>"

xml2 = "<note><to>Tove</to><from>Daniel</from><heading>Reminder</heading><body>Don't forget me this weekend!</body></note>"

tree1 = XmlTree.convert_string_to_tree(xml1)
tree2 = XmlTree.convert_string_to_tree(xml2)

comparator = XmlTree()

if comparator.xml_compare(tree1, tree2, ["from"]):
    print "XMLs match"
else:
    print "XMLs don't match"

Most of the credit for this code must be given to syawar

2 Comments

this code needs tweaks to work on Python3.5 as below: def __init__(self): self.logger = logging.getLogger('xml_compare') self.logger.setLevel(logging.DEBUG) self.hdlr = logging.FileHandler('xml-comparison.log') self.formatter = logging.Formatter('%(asctime)s - %(levelname)s- %(message)s') self.hdlr.setLevel(logging.DEBUG) self.hdlr.setFormatter(self.formatter) self.logger.addHandler(self.hdlr)
Maybe i missed something, but this method as soon as it found a difference stop checking the XML, it's a good way to start, but it does not work as expected...
7

If your intent is to compare the XML content and attributes, and not just compare the files byte-by-byte, there are subtleties to the question, so there is no solution that fits all cases.

You have to know something about what is important in the XML files.

The order of attributes listed in an element tag is generally not supposed to matter. That is, two XML files that differ only in the order of element attributes generally ought to be judged the same.

But that's the generic part.

The tricky part is application-dependent. For instance, it may be that white-space formatting of some elements of the file doesn't matter, and white-space might be added to the XML for legibility. And so on.

Recent versions of the ElementTree module have a function canonicalize(), which can take care of simpler cases, by putting the XML string into a canonical format.

I used this function in the unit tests of a recent project, to compare a known XML output with output from a package that sometimes changes the order of attributes. In this case, white space in the text elements was unimportant, but it was sometimes used for formatting.

import xml.etree.ElementTree as ET
def _canonicalize_XML( xml_str ):
    """ Canonicalizes XML strings, so they are safe to 
        compare directly. 
        Strips white space from text content."""

    if not hasattr( ET, "canonicalize" ):
        raise Exception( "ElementTree missing canonicalize()" )

    root = ET.fromstring( xml_str )
    rootstr = ET.tostring( root )
    return ET.canonicalize( rootstr, strip_text=True )

To use it, something like this:

file1 = ET.parse('file1.xml')
file2 = ET.parse('file2.xml')

canon1 = _canonicalize_XML( ET.tostring( file1.getroot() ) )
canon2 = _canonicalize_XML( ET.tostring( file2.getroot() ) )

print( canon1 == canon2 )

In my distribution, the Python 2 doesn't have canonicalize(), but Python 3 does.

1 Comment

This is a great answer and I think it should be the accepted one in 2023.
1

Another script using xml.etree. Its awful but it works :)

#!/usr/bin/env python

import sys
import xml.etree.ElementTree as ET

from termcolor import colored

tree1 = ET.parse(sys.argv[1])
root1 = tree1.getroot()

tree2 = ET.parse(sys.argv[2])
root2 = tree2.getroot()

class Element:
    def __init__(self,e):
        self.name = e.tag
        self.subs = {}
        self.atts = {}
        for child in e:
            self.subs[child.tag] = Element(child)

        for att in e.attrib.keys():
            self.atts[att] = e.attrib[att]

        print "name: %s, len(subs) = %d, len(atts) = %d" % ( self.name, len(self.subs), len(self.atts) )

    def compare(self,el):
        if self.name!=el.name:
            raise RuntimeError("Two names are not the same")
        print "----------------------------------------------------------------"
        print self.name
        print "----------------------------------------------------------------"
        for att in self.atts.keys():
            v1 = self.atts[att]
            if att not in el.atts.keys():
                v2 = '[NA]'
                color = 'yellow'
            else:
                v2 = el.atts[att]
                if v2==v1:
                    color = 'green'
                else:
                    color = 'red'
            print colored("first:\t%s = %s" % ( att, v1 ), color)
            print colored("second:\t%s = %s" % ( att, v2 ), color)

        for subName in self.subs.keys():
            if subName not in el.subs.keys():
                print colored("first:\thas got %s" % ( subName), 'purple')
                print colored("second:\thasn't got %s" % ( subName), 'purple')
            else:
                self.subs[subName].compare( el.subs[subName] )



e1 = Element(root1)
e2 = Element(root2)

e1.compare(e2)

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.