2

I have an XML file which contains the following strings:

<field name="id">abcdef</field>
<field name="intro" > pqrst</field>
<field name="desc"> this is a test file. We will show 5>2 and 3<5 and try to remove non xml compatible characters.</field>

In the body of the XML, I have > and < characters, which are not compatible with the XML specification. I need to replace them such that when > and < are in:

 ' "> ' 
 ' " > ' and 
 ' </ ' 

respectively, they should NOT be replaced, all other occurrence of > and < should be replaced by strings "greater than" and "less than". So the result should be like:

 <field name="id">abcdef</field>
 <field name="intro" > pqrst</field>
 <field name="desc"> this is a test file. We will show 5 greater than 2 and 3 less than 5 and try to remove non xml compatible characters.</field>

How can I do that with Python?

5
  • 1
    Better to provide the result, rather then describe it. Commented Nov 10, 2012 at 3:51
  • You could try parsing the entire file using the python regexp module here docs.python.org/2/library/re.html Are all the improper uses of < and > in your file in the case of numerical expressions? If so this should be pretty easy, just replace "# > #", "#> #", and "# >#" with "# is greater than #" and "# < #", "#< #", and "# <#" with "# is less than #" Commented Nov 10, 2012 at 8:15
  • No, they are not all numerical. Basically the problem is I can not come up with a suitable regexp. Commented Nov 10, 2012 at 9:14
  • Your constraints would replace the '<'s at the beginning of each line as they don't fall into any of the 3 cases you provided where they should not be substituted. It might be easier to provide for the cases in which they are subsituted. Commented Nov 10, 2012 at 10:38
  • ok, so what you're saying is that these may not all be numerical comparisons, but they are all comparisons by value? I assume you wouldn't want to translate '>' and '<' to 'greater than' and 'less than' in cases of stream redirection Commented Nov 10, 2012 at 12:11

3 Answers 3

2

You could use lxml.etree.XMLParser with recover=True option:

import sys
from lxml import etree

invalid_xml = """
<field name="id">abcdef</field>
<field name="intro" > pqrst</field>
<field name="desc"> this is a test file. We will show 5>2 and 3<5 and
try to remove non xml compatible characters.</field>
"""
root = etree.fromstring("<root>%s</root>" % invalid_xml,
                        parser=etree.XMLParser(recover=True))
root.getroottree().write(sys.stdout)

Output

<root>
<field name="id">abcdef</field>
<field name="intro"> pqrst</field>
<field name="desc"> this is a test file. We will show 5&gt;2 and 35 and
try to remove non xml compatible characters.</field>
</root>

Note: > is left in the document as &gt; and < is completely removed (as invalid character in xml text).

Regex-based solution

For simple xml-like content you could use re.split() to separate tags from the text and make the substitutions in non-tag text regions:

import re
from itertools import izip_longest
from xml.sax.saxutils import escape  # '<' -> '&lt;'

# assumptions:
#   doc = *( start_tag / end_tag / text )
#   start_tag = '<' name *attr [ '/' ] '>'
#   end_tag = '<' '/' name '>'
ws = r'[ \t\r\n]*'  # allow ws between any token
name = '[a-zA-Z]+'  # note: expand if necessary but the stricter the better
attr = '{name} {ws} = {ws} "[^"]*"'  # note: fragile against missing '"'; no "'"
start_tag = '< {ws} {name} {ws} (?:{attr} {ws})* /? {ws} >'
end_tag = '{ws}'.join(['<', '/', '{name}', '>'])
tag = '{start_tag} | {end_tag}'

assert '{{' not in tag
while '{' in tag: # unwrap definitions
    tag = tag.format(**vars())

tag_regex = re.compile('(%s)' % tag, flags=re.VERBOSE)

# escape &, <, > in the text
iters = [iter(tag_regex.split(invalid_xml))] * 2
pairs = izip_longest(*iters, fillvalue='')  # iterate 2 items at a time
print(''.join(escape(text) + tag for text, tag in pairs))

To avoid false positives for tags you could remove some of '{ws}' above.

Output

<field name="id">abcdef</field>
<field name="intro" > pqrst</field>
<field name="desc"> this is a test file. We will show 5&gt;2 and 3&lt;5 and
try to remove non xml compatible characters.</field>

Note: both <> are escaped in the text.

You could call any function instead of escape(text) above e.g.,

def escape4human(text):
    return text.replace('<', 'less than').replace('>', 'greater than')
Sign up to request clarification or add additional context in comments.

2 Comments

lxml-based solution: 3<5 becomes 35
@adray: yes. < is invalid in xml text so the xml parser can't parse it properly and recover=True option allows the parser to skip it.
2

Seems I did it for >:

re.sub('(?<! " )(?<! ")(?! )>','greater than', xml_string)

?<! - negative lookbehind assertion,

?! - negative lookahead assertion,

(...)(...) is logical AND,

so whole expression means "substitute all occurences of '>' which (does not start with ' " ') and (does not start with ' "') and ( does not end with ' ')

case < is similar

Comments

-3

Use ElementTree for XML parsing.

3 Comments

The ElementTree throws an exception because the XML has supposedly misplaced > and < characters.
@Mr_Spock: the point is that the XML is malformed, so ElementTree won't handle it. I just tried the lxml.html variant which can handle some malformed XML as well, but it too fails here.
I wonder why I received a downvote then. The guy's question wasn't even clear then. He didn't state what he was using until AFTER I brought up ElementTree. Seems a little unfair. I won't fret though.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.