Tokenization using regexp in Python

Question

I try tokenize a string like "spam bar ds<hai bye>sd baz eggs" into a list ['spam', 'bar', 'ds<hai bye>sd', 'baz', 'eggs'], i.e. like str.split() but preserving whitespace inside < ... >.

My solution was to use re.split with (\S*<.*?>\S*)|\s+ pattern. However I get the following:

>>> re.split('(\S*<.*?>\S*)|\s+', "spam bar ds<hai bye>sd baz eggs")
['spam', None, 'bar', None, '', 'ds<hai bye>sd', '', None, 'baz', None, 'eggs']

Not sure where are those Nones and empty strings are coming from. I, of course, can filter them out with a list comprehension [s for s in result if s], but I'm not comfortable doing that before I know the reason.

So, (1) why those Nones and empty strings, (2) could it be done better?

Cartroo · Accepted Answer · 2013-03-07 20:40:43Z

The None and empty string values are because you've used capturing brackets in your pattern, so the split is including matched text - see the official documentation for mention of this.

If you amend your pattern to r"((?:\S*<.*?>\S*)|\S+") (i.e. escaping the brackets to make then non-capturing and correcting the whitespace to a non-whitespace) it should work, but only by keeping the delimiters, which you then need to filter out by skipping alternate items. I think you're better off with this:

ITEM_RE = re.compile(r"(?:\S*<.*?>\S*)|\S+")
ITEM_RE.findall("spam bar ds<hai bye>sd baz eggs")

If you don't need an actual list (i.e. you only go through them one item at a time) then finditer() is more efficient as it only yields them one at a time. This is especially true if you're likely to bail out without going through the whole list.

It might also be possible in principle with a negative lookbehind assertion, but in practice I don't think it's possible to create one flexible enough - I tried r"(?<!<[^>]*)\s+" and got the error "look-behind requires fixed-width pattern", so I guess that's a no-no. The docs corroborate this - lookbehind assertions (both positive and negative) all need to be fixed width.

The issue with this approach is going to be if you expect nested angle brackets - then you're going to not get what you expect. For example, parsing ds<hai <bye> foo>sd will yield ds<hai <bye> as one token. I think this is the class of problem that regular expressions can't address - you need something closer to a proper parser. It wouldn't be hard to write one in pure Python which goes through character at a time and counts nesting levels of brackets, but that'll be quite slow. Depends whether you can be sure you'll only see one level of nesting in your input.

eyquem · Accepted Answer · 2013-03-08 13:45:06Z

1

I got this regex:

ss = "spam bar ds<hai bye>sd baz eggs ZQ<boo <abv> foo>WX  "

reg = re.compile('(?:'
                     '\S*?'
                     '<'
                     '[^<>]*?'
                     '(?:<[^<>]*>[^<>]*)*'
                     '[^<>]*?'
                     '>'
                       ')?'
                 '\S+')

print reg.findall(ss)

result

['spam', 'bar', 'ds<hai bye>sd', 'baz', 'eggs',
 'ZQ<boo <abv> foo>WX']

EDIT 1

A new regex, more accurate, after Cartroo's comment:

import re

pat = ('(?<!\S)'  # absence of non-whitespace before

       '(?:'
           '[^\s<>]+'

           '|'  # OR

           '(?:[^\s<>]*)'
           '(?:'
               '<'
               '[^<>]*?'
               '(?:<[^<>]*?>[^<>]*)*'
               '[^<>]*?'
               '>'
               ')'
           '(?:[^\s<>]*)'
       ')'

       '(?!\S)' # absence of non-whitespace after)
       )
reg = re.compile(pat)

ss = ("spam i>j bar ds<hai bye>sd baz eggs Z<boo <abv>"
      " foo>W ttt <two<;>*<:> three> ")
print '%s\n' % ss
print reg.findall(ss)

ss = "a<b<E1>c>d <b<E2>c>d <b<E3>c> a<<E4>c>d <<E5>>d 
   <<E6>> <<>>"
print '\n\n%s\n' % ss
print reg.findall(ss)

result

spam i>j bar ds<hai bye>sd baz eggs Z<boo <abv> foo>W 
ttt <two<;>*<:> three> 

['spam', 'bar', 'ds<hai bye>sd', 'baz', 'eggs', 
 'Z<boo <abv> foo>W', 'ttt', '<two<;>*<:> three>']


a<b<E1>c>d <b<E2>c>d <b<E3>c> a<<E4>c>d <<E5>>d <<E6>> <<>>

['a<b<E1>c>d', '<b<E2>c>d', '<b<E3>c>', 'a<<E4>c>d', '<<E5>>d',
 '<<E6>>', '<<>>']

The above strings were well formed and the results are consistent.
On a non-well-formed text (regarding the brackets), it may give non-desired results:

ss = """A<B<C>D  
 E<F<G>H 
I<J>K> 
 L<<M>N
   O<P>>Q
 R<<S>    T<<>"""
print '\n\n%s\n' % ss
print reg.findall(ss)

result

A<B<C>D  
 E<F<G>H 
I<J>K> 
 L<<M>N
   O<P>>Q
 R<<S>    T<<>

['E<F<G>H \nI<J>K>', 'L<<M>N\n   O<P>>Q']

That's because of the star at the end of '(?:<[^<>]*?>[^<>]*)*'. This behavior can be turned off by removing the star. This behavior is what makes it difficult to use regexes for analyzing such "convoluted" texts, as called by Crtaroo.

.

EDIT 2

When I said that the results 'E<F<G>H \nI<J>K>' and 'L<<M>N\n O<P>>Q' are non desired ones, it did't mean that the matching portions found are not respecting the regex's pattern (how could it be ?) as I crafted it; the matching portions are well formed, indeed:
two portions <G> and <J> are between two brackets < <G> <J> >
two portions <M> and <P> are between two brackets < <M> <P> >

In fact it was an understatement that implies that each matching portion found should extend in only one line. But as soon as an understatement is explicited, a possible solution emerges.
If matching portions extending on several lines are not desired, it's easy to tell to the regex to not match them, contrary to what I wrote. It suffices to add character \n at some places in the regex's pattern.

In fact, it means that the matching portions must not pass over a \n character and then this character can be considered as a separator of the matching portions. Hence, any other character can be wanted as a separator between matching portions present on the same line, for example # in the following code.

.

Regexes can't cook or fetch the kids from the school, but they are extremely powerful. Saying that behavior of a regex on a malformed text is an issue is too short: one must adds that it's an issue of the text, not the regex. A regex does what it is ordered to do: eating any text that is given to it. And it voraciously eat it, that is, without verifying any conformity about it, it is not an intended behaviour from it, and then it isn't responsible if it is fed with an undietetic text. Saying that behavior of a regex on malformed text is an issue sounds as if someone would reproach a kid to be sometimes nourished with whisky and peppered food.

It's of the responsability of the coder to ensure that the text passed to a regex is well formed. In the same way that a coder puts verification snippet in a code to ensure that the entries are integers in order that a program runs correctly.

.

This point is different from the misuse of regexes when one tries to parse a marked-up text as an XML one. Regexes are unable to parse such a text, OK, because it's impossible to craft a regex that will react correctly on a malformed marked-up text. It's also the responsability of the coder to not try to do that.
That doesn't mean that regexes must not be employed to analyze a marked-up text if this text has been validated.
Anyway, even a parser will not catch data if a text is too much malformed.

I mean that we must distinguish:

the nature of the text passed to a regex (malformed / well formed)
the nature of the pursued aim when using a regex (parsing / analyzing)

.

import re

ss = """
 A<:<11>:<12>:>
 fgh
 A<#:<33>:<34>:>
 A#<:<55>:<56>:>
 A<:<77>:<78> i<j>
 A<B<C>D #
 E<F<G>H #
 I<J>K> 
 L<<M>N 
 O<P>>Q  #
 R<<S>  T<<>"""
print '%s\n' % ss

pat = ('(?<!\S)'  # absence of non-whitespace before
           '(?:[^\s<>]*)'
           '(?:<'
               '[^<>]*?'
               '(?:<[^<>]*?>[^<>]*)*'
               '>)'
           '(?:[^\s<>]*)'
       '(?!\S)' # absence of non-whitespace after)
       )
reg = re.compile(pat)
print '------------------------------'
print '\n'.join(map(repr,reg.findall(ss)))


pat = ('(?<!\S)'  # absence of non-whitespace before
           '(?:[^\s<>]*)'
           '(?:<'
               '[^<>\n]*?'
               '(?:<[^<>\n]*?>[^<>\n]*)*'
               '>)'
           '(?:[^\s<>]*)'
       '(?!\S)' # absence of non-whitespace after)
       )
reg = re.compile(pat)
print '\n----------- with \\n -------------'
print '\n'.join(map(repr,reg.findall(ss)))


pat = ('(?<!\S)'  # absence of non-whitespace before
           '(?:[^\s<>]*)'
           '(?:<'
               '[^<>#]*?'
               '(?:<[^<>#]*?>[^<>#]*)*'
               '>)'
           '(?:[^\s<>]*)'
       '(?!\S)' # absence of non-whitespace after)
       )
reg = re.compile(pat)
print '\n------------- with # -----------'
print '\n'.join(map(repr,reg.findall(ss)))


pat = ('(?<!\S)'  # absence of non-whitespace before
           '(?:[^\s<>#]*)'
           '(?:<'
               '[^<>#]*?'
               '(?:<[^<>#]*?>[^<>#]*)*'
               '>)'
           '(?:[^\s<>]*)'
       '(?!\S)' # absence of non-whitespace after)
       )
reg = re.compile(pat)
print '\n------ with ^# everywhere -------'
print '\n'.join(map(repr,reg.findall(ss)))

result

 A<:<11>:<12>:>
 fgh
 A<#:<33>:<34>:>
 A#<:<55>:<56>:>
 A<:<77>:<78> i<j>
 A<B<C>D #
 E<F<G>H #
 I<J>K> 
 L<<M>N 
 O<P>>Q  #
 R<<S>  T<<>

------------------------------
'A<:<11>:<12>:>'
'A<#:<33>:<34>:>'
'A#<:<55>:<56>:>'
'i<j>'
'E<F<G>H #\n I<J>K>'
'L<<M>N \n O<P>>Q'

----------- with \n -------------
'A<:<11>:<12>:>'
'A<#:<33>:<34>:>'
'A#<:<55>:<56>:>'
'i<j>'

------------- with # -----------
'A<:<11>:<12>:>'
'A#<:<55>:<56>:>'
'i<j>'
'L<<M>N \n O<P>>Q'

------ with ^# everywhere -------
'A<:<11>:<12>:>'
'i<j>'
'L<<M>N \n O<P>>Q'

edited Mar 8, 2013 at 13:45

answered Mar 7, 2013 at 22:12

eyquem

27.7k7 gold badges43 silver badges46 bronze badges

7 Comments

Cartroo Over a year ago

Except that this pattern requires at least one character outside the angle brackets. For example, try it on "one <two three> four" and it won't do what you expect. You could perhaps fix this by changing the final \S+ to \S* and adding an additional alternation to cover the \S+ case (i.e. no angle brackets) - this works because if there's an angle bracket then you don't actually need any other characters to make it a valid token (I assume). I haven't tested that, however. Either way, this is convolution is suggesting that regexps aren't the optimal solution here.

Alan Moore Over a year ago

+1 for the second regex, but why are you trying to accommodate nested brackets? I don't see that requirement in the question or in any of the comments.

eyquem Over a year ago

@AlanMoore It isn't to answer to a need expressed by OP, it's for fun, personal challenge, and to show to Cartroo how regexes can do that. He says in his answer : "The issue with this approach is going to be if you expect nested angle brackets - then you're going to not get what you expect (...) I think this is the class of problem that regular expressions can't address - you need something closer to a proper parser" If OP doesn't want this feature, he will remove parts of the regex

Cartroo Over a year ago

You've certainly achieved more than I could with regexps, but the behaviour on malformed strings is an issue. Also, the minimal matching introduces a lot of backtracking which will likely make the regexp solution considerably less efficient than a traditional state-based lexer. Hats off for the attempt, nonetheless.

Alan Moore Over a year ago

@Cartroo: That's a non-issue; you can write inefficient regexes just as easily with greedy quantifiers as you can with non-greedy ones. If he had been using .*? all over the place I might have agreed with you, not because he used reluctant quantifiers, but because he was being sloppy. In actual fact, these reluctant quantifiers are having no effect at all; [^<>]* was going to stop before the next < or > anyway. (However, the last [^<>]*? is redundant; that's already covered in the previous atom: (?:<[^<>]*?>[^<>]*)*.)

|

neilr8133 · Accepted Answer · 2013-03-07 20:24:29Z

0

I believe the None values are due to the presence of ()s in the pattern based on this line from the documentation:

If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list

Using the Regex Tester on your input may also help visualize the parsing: http://regexpal.com/?flags=g&regex=%28\S*%3C.*%3F%3E\S*%29|\s%2B&input=spam%20bar%20ds%3Chai%20bye%3Esd%20baz%20eggs

answered Mar 7, 2013 at 20:24

neilr8133

1526 bronze badges

Collectives™ on Stack Overflow

Tokenization using regexp in Python

3 Answers 3

Comments

EDIT 1

EDIT 2

7 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

EDIT 1

EDIT 2

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related