5

I've been sitting on this for 2 days now, and still didn't find an effective way to do so. Lets say we have the string:

<5629476219<421fsaas42f>14222<2412f2<2421savsar21>12vsaf21>412<<<142avsa1>1a24>421>421>

The output I want:

<562947621914222412421>

Well, recursively there might be data inside <> brackets, it could consist of numbers and letters - but the first level consists only of numbers. I bolded out the data I want to extract.

I want to do this in Python. The naive way is to implement a bracket stack of course (that way I'll know if I'm inside an inner bracket or I'm in the 1st level) - but it's very inefficient going char by char. I believe there is a good regex pattern I could use, but I haven't come up with something that works.

Can some with enough experience with regex give a little help?

Of course, other ideas except running char by char iteratively are welcome as well, run-time is important to me.

1
  • 1
    How deeply is your real data nested? Commented Dec 21, 2019 at 17:08

4 Answers 4

5

Of course, other ideas except running char by char iteratively are welcome as well, run-time is important to me.

Of course, any regex also has to run through the string character by character, too. Don't rule out the "naive" solution so easily: it turns out the simple way is more efficient than all three of the posted answers so far.


Here's a solution like your "naive" one: but it doesn't require a stack, because there is only one kind of open-bracket. Even with multiple kinds of brackets, you only need a stack if you also want to detect when the brackets are mismatched.

def chars_at_level(s):
    out = ['<']
    nesting_level = 0

    for c in s:
        if c == '<':
            nesting_level += 1
        elif c == '>':
            nesting_level -= 1
        elif nesting_level == 1:
            out.append(c)

    out.append('>')
    return ''.join(out)

Example:

>>> s = '<5629476219<421fsaas42f>14222<2412f2<2421savsar21>12vsaf21>412<<<142avsa1>1a24>421>421>'
>>> chars_at_level(s)
'<562947621914222412421>'

Now for the performance comparison. It beats the other three solutions, though Seb's solution is close.

>>> timeit(lambda: chars_at_level(s))
7.594452977000401
>>> timeit(lambda: parse(s)) # Seb's solution using re.sub
7.817124693000096
>>> timeit(lambda: regex_sub(s)) # bobble bubble's recursive regex
9.322779934999744
>>> timeit(lambda: nested_list(s)) # Ajax1234's nested list solution
17.795835303999866

However, Seb's solution is much less efficient in the worst case, on strings like <<<<<<1>>>>>>, because it does O(n) replacements on strings of length O(n), giving a running time of O(n²). The other two posted solutions still seem to be about O(n) on this kind of string, though I had to increase the system recursion limit for Ajax1234's solution to work. The "naive" solution is still fastest.

>>> t = (1000 * '<') + '1' + (1000 * '>')
>>> timeit(lambda: chars_at_level(t), number=1000)
0.1329130509998322
>>> timeit(lambda: parse(t), number=1000) # Seb's solution using re.sub
31.281542531000014
>>> timeit(lambda: regex_sub(t), number=1000) # bobble bubble's recursive regex
0.705901896999876
>>> timeit(lambda: nested_list(t), number=1000) # Ajax1234's nested list solution
1.1296931150000091

By the way, even if you do want to augment the "naive" solution with a stack, it still only takes O(n) time. It's also fairly trivial to change this algorithm to get the characters at any other nesting level, too.

Sign up to request clarification or add additional context in comments.

Comments

3

This is one way; it recursively removes inner complete tags <[^<>]*> until only the outer level elements remain:

def parse(string):
    while True:
        output = re.sub(r'(?<!^)<([^<>]*)>(?!$)', '', string)
        if output == string:
            break
        string = output
    return output
>>> string = '<5629476219<421fsaas42f>14222<2412f2<2421savsar21>12vsaf21>412<<<142avsa1>1a24>421>421>'
>>> parse(string)
'<562947621914222412421>'
>>> %timeit parse(string)
6.57 µs ± 99.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

There should also be a way to do the recursion in regex itself, but I couldn't quite make it work. The built-in module re does not support this, but regex does. Here are some ideas on that.

Comments

3

There is an alternative regex module available in Python that allows recursive patterns.

With this you could use such pattern for balanced brackets and replace with empty string.

regex.sub(r'(?!^)<(?:[^><]*|(?R))+>', '', s)

See this regex101 demo or a Python demo, results in

<562947621914222412421>

At (?R) the pattern is pasted from start, same like (?0). Used (?!^) to omit first <

Comments

2

You can use recursion to create a nested list from the structure. Then, to produce the desired result, you can simply find all the top level strings in the list:

import re
data = '<5629476219<421fsaas42f>14222<2412f2<2421savsar21>12vsaf21>412<<<142avsa1>1a24>421>421>'
def get_data(d):
  if (val:=next(d, None)) not in {'<', '>'}:
     yield val
  if val == '<':
     yield list(get_data(d))
  if val is not None and val != '>':
     yield from get_data(d)

result = '<'+''.join(i for i in list(get_data(iter(re.findall('[^\<\>]+|[\<\>]', data))))[0] if isinstance(i, str))+'>'

Output:

'<562947621914222412421>'

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.