I got this regex:
ss = "spam bar ds<hai bye>sd baz eggs ZQ<boo <abv> foo>WX "
reg = re.compile('(?:'
'\S*?'
'<'
'[^<>]*?'
'(?:<[^<>]*>[^<>]*)*'
'[^<>]*?'
'>'
')?'
'\S+')
print reg.findall(ss)
result
['spam', 'bar', 'ds<hai bye>sd', 'baz', 'eggs',
'ZQ<boo <abv> foo>WX']
EDIT 1
A new regex, more accurate, after Cartroo's comment:
import re
pat = ('(?<!\S)' # absence of non-whitespace before
'(?:'
'[^\s<>]+'
'|' # OR
'(?:[^\s<>]*)'
'(?:'
'<'
'[^<>]*?'
'(?:<[^<>]*?>[^<>]*)*'
'[^<>]*?'
'>'
')'
'(?:[^\s<>]*)'
')'
'(?!\S)' # absence of non-whitespace after)
)
reg = re.compile(pat)
ss = ("spam i>j bar ds<hai bye>sd baz eggs Z<boo <abv>"
" foo>W ttt <two<;>*<:> three> ")
print '%s\n' % ss
print reg.findall(ss)
ss = "a<b<E1>c>d <b<E2>c>d <b<E3>c> a<<E4>c>d <<E5>>d
<<E6>> <<>>"
print '\n\n%s\n' % ss
print reg.findall(ss)
result
spam i>j bar ds<hai bye>sd baz eggs Z<boo <abv> foo>W
ttt <two<;>*<:> three>
['spam', 'bar', 'ds<hai bye>sd', 'baz', 'eggs',
'Z<boo <abv> foo>W', 'ttt', '<two<;>*<:> three>']
a<b<E1>c>d <b<E2>c>d <b<E3>c> a<<E4>c>d <<E5>>d <<E6>> <<>>
['a<b<E1>c>d', '<b<E2>c>d', '<b<E3>c>', 'a<<E4>c>d', '<<E5>>d',
'<<E6>>', '<<>>']
The above strings were well formed and the results are consistent.
On a non-well-formed text (regarding the brackets), it may give non-desired results:
ss = """A<B<C>D
E<F<G>H
I<J>K>
L<<M>N
O<P>>Q
R<<S> T<<>"""
print '\n\n%s\n' % ss
print reg.findall(ss)
result
A<B<C>D
E<F<G>H
I<J>K>
L<<M>N
O<P>>Q
R<<S> T<<>
['E<F<G>H \nI<J>K>', 'L<<M>N\n O<P>>Q']
That's because of the star at the end of '(?:<[^<>]*?>[^<>]*)*'. This behavior can be turned off by removing the star. This behavior is what makes it difficult to use regexes for analyzing such "convoluted" texts, as called by Crtaroo.
.
EDIT 2
When I said that the results 'E<F<G>H \nI<J>K>' and 'L<<M>N\n O<P>>Q' are non desired ones, it did't mean that the matching portions found are not respecting the regex's pattern (how could it be ?) as I crafted it; the matching portions are well formed, indeed:
two portions <G> and <J> are between two brackets < <G> <J> >
two portions <M> and <P> are between two brackets < <M> <P> >
In fact it was an understatement that implies that each matching portion found should extend in only one line. But as soon as an understatement is explicited, a possible solution emerges.
If matching portions extending on several lines are not desired, it's easy to tell to the regex to not match them, contrary to what I wrote. It suffices to add character \n at some places in the regex's pattern.
In fact, it means that the matching portions must not pass over a \n character and then this character can be considered as a separator of the matching portions. Hence, any other character can be wanted as a separator between matching portions present on the same line, for example # in the following code.
.
Regexes can't cook or fetch the kids from the school, but they are extremely powerful. Saying that behavior of a regex on a malformed text is an issue is too short: one must adds that it's an issue of the text, not the regex. A regex does what it is ordered to do: eating any text that is given to it. And it voraciously eat it, that is, without verifying any conformity about it, it is not an intended behaviour from it, and then it isn't responsible if it is fed with an undietetic text. Saying that behavior of a regex on malformed text is an issue sounds as if someone would reproach a kid to be sometimes nourished with whisky and peppered food.
It's of the responsability of the coder to ensure that the text passed to a regex is well formed. In the same way that a coder puts verification snippet in a code to ensure that the entries are integers in order that a program runs correctly.
.
This point is different from the misuse of regexes when one tries to parse a marked-up text as an XML one. Regexes are unable to parse such a text, OK, because it's impossible to craft a regex that will react correctly on a malformed marked-up text. It's also the responsability of the coder to not try to do that.
That doesn't mean that regexes must not be employed to analyze a marked-up text if this text has been validated.
Anyway, even a parser will not catch data if a text is too much malformed.
I mean that we must distinguish:
.
import re
ss = """
A<:<11>:<12>:>
fgh
A<#:<33>:<34>:>
A#<:<55>:<56>:>
A<:<77>:<78> i<j>
A<B<C>D #
E<F<G>H #
I<J>K>
L<<M>N
O<P>>Q #
R<<S> T<<>"""
print '%s\n' % ss
pat = ('(?<!\S)' # absence of non-whitespace before
'(?:[^\s<>]*)'
'(?:<'
'[^<>]*?'
'(?:<[^<>]*?>[^<>]*)*'
'>)'
'(?:[^\s<>]*)'
'(?!\S)' # absence of non-whitespace after)
)
reg = re.compile(pat)
print '------------------------------'
print '\n'.join(map(repr,reg.findall(ss)))
pat = ('(?<!\S)' # absence of non-whitespace before
'(?:[^\s<>]*)'
'(?:<'
'[^<>\n]*?'
'(?:<[^<>\n]*?>[^<>\n]*)*'
'>)'
'(?:[^\s<>]*)'
'(?!\S)' # absence of non-whitespace after)
)
reg = re.compile(pat)
print '\n----------- with \\n -------------'
print '\n'.join(map(repr,reg.findall(ss)))
pat = ('(?<!\S)' # absence of non-whitespace before
'(?:[^\s<>]*)'
'(?:<'
'[^<>#]*?'
'(?:<[^<>#]*?>[^<>#]*)*'
'>)'
'(?:[^\s<>]*)'
'(?!\S)' # absence of non-whitespace after)
)
reg = re.compile(pat)
print '\n------------- with # -----------'
print '\n'.join(map(repr,reg.findall(ss)))
pat = ('(?<!\S)' # absence of non-whitespace before
'(?:[^\s<>#]*)'
'(?:<'
'[^<>#]*?'
'(?:<[^<>#]*?>[^<>#]*)*'
'>)'
'(?:[^\s<>]*)'
'(?!\S)' # absence of non-whitespace after)
)
reg = re.compile(pat)
print '\n------ with ^# everywhere -------'
print '\n'.join(map(repr,reg.findall(ss)))
result
A<:<11>:<12>:>
fgh
A<#:<33>:<34>:>
A#<:<55>:<56>:>
A<:<77>:<78> i<j>
A<B<C>D #
E<F<G>H #
I<J>K>
L<<M>N
O<P>>Q #
R<<S> T<<>
------------------------------
'A<:<11>:<12>:>'
'A<#:<33>:<34>:>'
'A#<:<55>:<56>:>'
'i<j>'
'E<F<G>H #\n I<J>K>'
'L<<M>N \n O<P>>Q'
----------- with \n -------------
'A<:<11>:<12>:>'
'A<#:<33>:<34>:>'
'A#<:<55>:<56>:>'
'i<j>'
------------- with # -----------
'A<:<11>:<12>:>'
'A#<:<55>:<56>:>'
'i<j>'
'L<<M>N \n O<P>>Q'
------ with ^# everywhere -------
'A<:<11>:<12>:>'
'i<j>'
'L<<M>N \n O<P>>Q'