0

Okay so I'm asking a follow up question for tokenizing a string. It's almost working however I missing this one edge case.

Right now my function is:

def tokenize(text):
    return re.findall('[\\!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~]|\w+', text)

And it almost does what I want it do except for the input:

>>> tokenize('Break/\\ is almost? ? soon')
Output: ['Break', '/','is', 'almost', '?', '?', 'soon']

Expected Output:
['Break', '/', '\\', 'is', 'almost', '?', '?', 'soon']

I guess its something to do with escaping but i thought i matched it in my regex. Any suggestions?

7
  • 1
    I must be missing something... what is the difference between the produced output and your expected output? I'm apparently no good at the "Spot the difference" game today. Commented Nov 25, 2014 at 19:56
  • If you're having problems with escaping, your life will be a whole lot easier if you use raw string literals. That's what they're for. Commented Nov 25, 2014 at 19:57
  • Also, I can see at least one missing escape: \w should be \\w. But you happen to get away with that one, because (at least in 2.7 and 3.4) \w isn't a backslash escape sequence, so that can't be your problem. Commented Nov 25, 2014 at 19:57
  • sorry i had messed up the output. updated it now. what do you mean by using raw string literals? Commented Nov 25, 2014 at 19:57
  • @user3750474: If you prefix a string literal with r, it leaves all backslashes between the quotes alone, so you can write r'[\!"#$…' and trust that the `` will get through to the regex parser instead of being interpreted by Python itself. Commented Nov 25, 2014 at 19:59

2 Answers 2

2

Your problem is that the only backslashes inside your character classes are being interpreted as escape characters. The \\! is parsed by Python into \!, and then by the regexp engine into an escaped !. Likewise, the \\] is parsed by Python into \], and then by the regexp engine into an escaped ]. So, there's nothing to match a backslash.

You could double-escape the first backslashes, so the \\\\! will get parsed by Python into \\! and then by the regexp engine into a \ followed by a !. Of course you'd leave the \\] alone, because you want that to be parsed as an escaped ]. And you'd want to escape the backslash before w as well; you happen to get away with that one because Python (at least as of 2.7 and 3.4) doesn't have a \w escape sequence, but it's not a good idea to count on that.

But really, your life will be a lot easier if you use raw string literals, to prevent Python from interpreting any backslashes, so you know they all get to the regexp engine. This is explained in the Regular Expression HOWTO.

re.findall(r'[\\!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~]|\w+', text)

Now, the \\! is not touched by Python, so the regexp engine interprets it as a literal \ and a !. Also note that I've removed the double backslash before ], because we don't want to escape that one, we want it to escape the ].

[\\!"#$%&\'()*+,-./:;<=>?@[\]^_`{|}~]|\w+

Regular expression visualization

Debuggex Demo

Sign up to request clarification or add additional context in comments.

2 Comments

Follow up: Is the a way to treat the input text where you matching against raw string literals?
@user3750474: I don't understand the followup. If your problem is that the input string is a literal in your Python source code so its backslashes are being escaped, then yes, just stick a r before the open quotes and the problem will go away. If the input string came from somewhere else, it's not a literal; you get the bytes the way they were stored in the file or sent over the socket or calculated by the function or whatever, and there's no way to retroactively change what the editor/server/whatever did from your code. If you have a specific example you might want to create a new question.
0

off topic but this also works

list(filter(str.strip,re.split('(\W)','Break/\\ is almost? ? soon')))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.