Matching patterns in Python

Question

I have an XML file which contains the following strings:

<field name="id">abcdef</field>
<field name="intro" > pqrst</field>
<field name="desc"> this is a test file. We will show 5>2 and 3<5 and try to remove non xml compatible characters.</field>

In the body of the XML, I have > and < characters, which are not compatible with the XML specification. I need to replace them such that when > and < are in:

 ' "> ' 
 ' " > ' and 
 ' </ '

respectively, they should NOT be replaced, all other occurrence of > and < should be replaced by strings "greater than" and "less than". So the result should be like:

 <field name="id">abcdef</field>
 <field name="intro" > pqrst</field>
 <field name="desc"> this is a test file. We will show 5 greater than 2 and 3 less than 5 and try to remove non xml compatible characters.</field>

How can I do that with Python?

You could try parsing the entire file using the python regexp module here docs.python.org/2/library/re.html Are all the improper uses of < and > in your file in the case of numerical expressions? If so this should be pretty easy, just replace "# > #", "#> #", and "# >#" with "# is greater than #" and "# < #", "#< #", and "# <#" with "# is less than #" — Hart Simha
– Hart Simha, Commented Nov 10, 2012 at 8:15
No, they are not all numerical. Basically the problem is I can not come up with a suitable regexp. — rivu
– rivu, Commented Nov 10, 2012 at 9:14
Your constraints would replace the '<'s at the beginning of each line as they don't fall into any of the 3 cases you provided where they should not be substituted. It might be easier to provide for the cases in which they are subsituted. — Hart Simha
– Hart Simha, Commented Nov 10, 2012 at 10:38
ok, so what you're saying is that these may not all be numerical comparisons, but they are all comparisons by value? I assume you wouldn't want to translate '>' and '<' to 'greater than' and 'less than' in cases of stream redirection — Hart Simha
– Hart Simha, Commented Nov 10, 2012 at 12:11

jfs · Accepted Answer · 2012-11-11 23:36:27Z

You could use lxml.etree.XMLParser with recover=True option:

import sys
from lxml import etree

invalid_xml = """
<field name="id">abcdef</field>
<field name="intro" > pqrst</field>
<field name="desc"> this is a test file. We will show 5>2 and 3<5 and
try to remove non xml compatible characters.</field>
"""
root = etree.fromstring("<root>%s</root>" % invalid_xml,
                        parser=etree.XMLParser(recover=True))
root.getroottree().write(sys.stdout)

Output

<root>
<field name="id">abcdef</field>
<field name="intro"> pqrst</field>
<field name="desc"> this is a test file. We will show 5&gt;2 and 35 and
try to remove non xml compatible characters.</field>
</root>

Note: > is left in the document as > and < is completely removed (as invalid character in xml text).

Regex-based solution

For simple xml-like content you could use re.split() to separate tags from the text and make the substitutions in non-tag text regions:

import re
from itertools import izip_longest
from xml.sax.saxutils import escape  # '<' -> '&lt;'

# assumptions:
#   doc = *( start_tag / end_tag / text )
#   start_tag = '<' name *attr [ '/' ] '>'
#   end_tag = '<' '/' name '>'
ws = r'[ \t\r\n]*'  # allow ws between any token
name = '[a-zA-Z]+'  # note: expand if necessary but the stricter the better
attr = '{name} {ws} = {ws} "[^"]*"'  # note: fragile against missing '"'; no "'"
start_tag = '< {ws} {name} {ws} (?:{attr} {ws})* /? {ws} >'
end_tag = '{ws}'.join(['<', '/', '{name}', '>'])
tag = '{start_tag} | {end_tag}'

assert '{{' not in tag
while '{' in tag: # unwrap definitions
    tag = tag.format(**vars())

tag_regex = re.compile('(%s)' % tag, flags=re.VERBOSE)

# escape &, <, > in the text
iters = [iter(tag_regex.split(invalid_xml))] * 2
pairs = izip_longest(*iters, fillvalue='')  # iterate 2 items at a time
print(''.join(escape(text) + tag for text, tag in pairs))

To avoid false positives for tags you could remove some of '{ws}' above.

Output

<field name="id">abcdef</field>
<field name="intro" > pqrst</field>
<field name="desc"> this is a test file. We will show 5&gt;2 and 3&lt;5 and
try to remove non xml compatible characters.</field>

Note: both <> are escaped in the text.

You could call any function instead of escape(text) above e.g.,

def escape4human(text):
    return text.replace('<', 'less than').replace('>', 'greater than')

@adray: yes. < is invalid in xml text so the xml parser can't parse it properly and recover=True option allows the parser to skip it.

adray · Accepted Answer · 2012-11-10 10:45:04Z

2

Seems I did it for >:

re.sub('(?<! " )(?<! ")(?! )>','greater than', xml_string)

?<! - negative lookbehind assertion,

?! - negative lookahead assertion,

(...)(...) is logical AND,

so whole expression means "substitute all occurences of '>' which (does not start with ' " ') and (does not start with ' "') and ( does not end with ' ')

case < is similar

edited Nov 10, 2012 at 10:45

answered Nov 10, 2012 at 10:31

adray

1,45816 silver badges20 bronze badges

Comments

Mr_Spock · Accepted Answer · 2012-11-10 04:03:22Z

-3

Use ElementTree for XML parsing.

answered Nov 10, 2012 at 4:03

Mr_Spock

3,8356 gold badges28 silver badges34 bronze badges

3 Comments

rivu Over a year ago

The ElementTree throws an exception because the XML has supposedly misplaced > and < characters.

Fred Foo Over a year ago

@Mr_Spock: the point is that the XML is malformed, so ElementTree won't handle it. I just tried the lxml.html variant which can handle some malformed XML as well, but it too fails here.

Mr_Spock Over a year ago

I wonder why I received a downvote then. The guy's question wasn't even clear then. He didn't state what he was using until AFTER I brought up ElementTree. Seems a little unfair. I won't fret though.

Collectives™ on Stack Overflow

Matching patterns in Python

3 Answers 3

Output

Regex-based solution

Output

2 Comments

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Output

Regex-based solution

Output

2 Comments

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related