3

What would be the best way to do this.

The input string is

<133_3><135_3><116_2>The other system worked for about 1 month</116_2> got some good images <137_3>on it then it started doing the same thing as the first one</137_3> so then I quit using either camera now they are just sitting and collecting dust.</135_3></133_3>

the expected output is

{'The other system worked for about 1 month got some good images on it then it started doing the same thing as the first one so then I quit \
using either camera now they are just sitting and collecting dust.':[133, 135],

'The other system worked for about 1 month': [116],

'on it then it started doing the same thing as the first one':[137]

}

that seems like a recursive regexp search but I can't figure out how exactly.

I can think of a tedious recursive function as of now, but have a feeling that there should be a better way.

Related question: Can regular expressions be used to match nested patterns?

6 Answers 6

4

Use expat or another XML parser; it's more explicit than anything else, considering you're dealing with XML data anyway.

However, note that XML element names can't start with a number as your example has them.

Here's a parser that will do what you need, although you'll need to tweak it to combine duplicate elements into one dict key:

from xml.parsers.expat import ParserCreate

open_elements = {}
result_dict = {}

def start_element(name, attrs):
    open_elements[name] = True

def end_element(name):
    del open_elements[name]

def char_data(data):
    for element in open_elements:
        cur = result_dict.setdefault(element, '')
        result_dict[element] = cur + data

if __name__ == '__main__':
    p = ParserCreate()

    p.StartElementHandler = start_element
    p.EndElementHandler = end_element
    p.CharacterDataHandler = char_data

    p.Parse(u'<_133_3><_135_3><_116_2>The other system worked for about 1 month</_116_2> got some good images <_137_3>on it then it started doing the same thing as the first one</_137_3> so then I quit using either camera now they are just sitting and collecting dust.</_135_3></_133_3>', 1)

    print result_dict
Sign up to request clarification or add additional context in comments.

Comments

4

Take an XML parser, make it generate a DOM (Document Object Model) and then build a recursive algorithm that traverses all the nodes, calls "text()" in each node (that should give you the text in the current node and all children) and puts that as a key in the dictionary.

Comments

2
from cStringIO   import StringIO
from collections import defaultdict
####from xml.etree   import cElementTree as etree
from lxml import etree

xml = "<e133_3><e135_3><e116_2>The other system worked for about 1 month</e116_2> got some good images <e137_3>on it then it started doing the same thing as the first one</e137_3> so then I quit using either camera now they are just sitting and collecting dust. </e135_3></e133_3>"

d = defaultdict(list)
for event, elem in etree.iterparse(StringIO(xml)):
    d[''.join(elem.itertext())].append(int(elem.tag[1:-2]))

print(dict(d.items()))

Output:

{'on it then it started doing the same thing as the first one': [137], 
'The other system worked for about 1 month': [116], 
'The other system worked for about 1 month got some good images on it then it started doing the same thing as the first one so then I quit using \
either camera now they are just sitting and collecting dust. ': [133, 135]}

2 Comments

Sebastian Did you test it? For me it fails with the following traceback Traceback (most recent call last): File "recurse_regex.py", line 55, in <module> d[''.join(elem.itertext())].append(int(elem.tag[1:-2])) AttributeError: 'etree._Element' object has no attribute 'itertext'
@JV: Yes, I've run it. 'xml.etree.cElementTree has no 'itertext', but 'lxml.etree' has. 'lxml' is not a part of Python's standard library, but 'cElementTree' is. I don't know what is better for OP: to install lxml 'easy_install lxml' or to write 'itertext' for cElementTree.
1

I think a grammar would be the best option here. I found a link with some information: http://www.onlamp.com/pub/a/python/2006/01/26/pyparsing.html

Comments

1

Note that you can't actually solve this by a regular expression, since they don't have the expressive power to enforce proper nesting.

Take the following mini-language:

A certain number of "(" followed by the same number of ")", no matter what the number.

You could make a regular expression very easily to represent a super-language of this mini-language (where you don't enforce the equality of the number of starts parentheses and end parentheses). You could also make a regular expression very easilty to represent any finite sub-language (where you limit yourself to some max depth of nesting). But you can never represent this exact language in a regular expression.

So you'd have to use a grammar, yes.

1 Comment

Regular expressions in real languages might be more powerful than regular expressions.
0

Here's an unreliable inefficient recursive regexp solution:

import re

re_tag = re.compile(r'<(?P<tag>[^>]+)>(?P<content>.*?)</(?P=tag)>', re.S)

def iterparse(text, tag=None):
    if tag is not None: yield tag, text
    for m in re_tag.finditer(text):
        for tag, text in iterparse(m.group('content'), m.group('tag')):
            yield tag, text

def strip_tags(content):
    nested = lambda m: re_tag.sub(nested, m.group('content'))
    return re_tag.sub(nested, content)


txt = "<133_3><135_3><116_2>The other system worked for about 1 month</116_2> got some good images <137_3>on it then it started doing the same thing as the first one</137_3> so then I quit using either camera now they are just sitting and collecting dust. </135_3></133_3>"
d = {}
for tag, text in iterparse(txt):
    d.setdefault(strip_tags(text), []).append(int(tag[:-2]))

print(d)

Output:

{'on it then it started doing the same thing as the first one': [137], 
 'The other system worked for about 1 month': [116], 
 'The other system worked for about 1 month got some good images on it then it started doing the same thing as the first one so then I quit using \
 either camera now they are just sitting and collecting dust. ': [133, 135]}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.