7

I need to search a string for multiple words.

import re

words = [{'word':'test1', 'case':False}, {'word':'test2', 'case':False}]

status = "test1 test2"

for w in words:
    if w['case']:
        r = re.compile("\s#?%s" % w['word'], re.IGNORECASE|re.MULTILINE)
    else:
        r = re.compile("\s#?%s" % w['word'], re.MULTILINE)
    if r.search(status):
        print "Found word %s" % w['word']

For some reason, this will only ever find "test2" and never "test1". Why is this?

I know I can use | delimitated searches but there could be hundreds of words which is why I am using a for loop.

2 Answers 2

9

There is no space before test1 in status, while your generated regular expressions require there to be a space.

You can modify the test to match either after a space or at the beginning of a line:

for w in words:
    if w['case']:
        r = re.compile("(^|\s)#?%s" % w['word'], re.IGNORECASE|re.MULTILINE)
    else:
        r = re.compile("(^|\s)#?%s" % w['word'], re.MULTILINE)
    if r.search(status):
        print "Found word %s" % w['word']
Sign up to request clarification or add additional context in comments.

3 Comments

It's recommended to use raw strings when the string contains backslashes.
Thanks, not sure how I missed that.
@MRAB I focused on the immediate problem; there are a number of other improvements possible (such as a flags variable to remove the redundant re.compile line.
2

As Martijn pointed out, there's no space before test1. But also your code doesn't properly handle the case when a word is longer. Your code would find test2blabla as an instance of test2, and I'm not sure if that is what you want.

I suggest using word boundary regex \b:

for w in words:
    if w['case']:
        r = re.compile(r"\b%s\b" % w['word'], re.IGNORECASE|re.MULTILINE)
    else:
        r = re.compile(r"\b%s\b" % w['word'], re.MULTILINE)
    if r.search(status):
        print "Found word %s" % w['word']

EDIT:

I should've pointed out that if you really want to allow only (whitespace)word or (whitespace)#word format, you cannot use \b.

3 Comments

You are missing the #? from the original test.
Fair enough. But since that didn't make a difference for the test string, I dropped it. Of course, it interferes with the word boundary.
Isn't that a rather important detail then?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.