2

I have a string. I need to know if any of the following substrings appear in the string.

So, if I have:

thing_name = "VISA ASSESSMENTS"

I've been doing my searches with:

any((_ in thing_name for _ in ['ASSESSMENTS','KILOBYTE','INTERNATIONAL']))

I'm going through a long list of thing_name items, and I don't need to filter, exactly, just check for any number of substrings.

Is this the best way to do this? It feels wrong, but I can't think of a more efficient way to pull this off.

13
  • Are you searching for exact words VISA and ASSESSMENTS or any substring? Commented Aug 19, 2013 at 18:22
  • You could probably speed things up a bit if your list of possible substrings were a set, but other than that I don't think there's anything wrong with this approach. This is fine and readable. The next step would be serious algorithm design. Commented Aug 19, 2013 at 18:39
  • @AshwiniChaudhary, I am looking for either 'ASSESSMENTS', 'KILOBYTES' or one of several other items. Commented Aug 19, 2013 at 18:44
  • @kojiro, Thanks. I'll run some speed tests. Commented Aug 19, 2013 at 18:45
  • 1
    @JoshEnglish I don't know your initial data, but it could be worth eliminating substrings within the set: for example, if you have {'foo', 'bar', 'foobaz'} you can remove foobaz from the set. Commented Aug 19, 2013 at 18:47

2 Answers 2

1

You can try re.search to see if that is faster. Something along the lines of

import re
pattern = re.compile('|'.join(['ASSESSMENTS','KILOBYTE','INTERNATIONAL']))
isMatch = (pattern.search(thing_name) != None)
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for the suggestions, everyone. As a result of this, I did some timing and found the re solutions to be the fastest in my situation. I came up with eight different ways of doing this, and taking data from my situation, given the length and number of fields, the RE is fastest. Pre-Compiling the regular expression was the best, and second best was compiling the regular expression during each run through my list. Set intersection was the third fastest, but the surprise came when checking the length of set and comparing to 0 was faster than simply transforming the intersection to boolean.
I'd suggest to use re.escape on the substrings before passing them to re.compile(i.e. re.compile('|'.join(re.escape(sub) for sub in [...]))), otherwise a substring that contains non-alphanumeric characters might create problems(it can even result in re.compile failing).
@Bakuriu, +1 for telling about re.escape.
0

If your list of substrings is small and the input is small, then using a for loop to do compares is fine.

Otherwise the fastest way I know to search a string for a (large) list of substrings is to construct a DAWG of the word list and then iterate through the input string, keeping a list of DAWG traversals and registering the substrings at each successful traverse.

Another way is to add all the substrings to a hashtable and then hash every possible substring (up to the length of the longest substring) as you traverse the input string.

It's been a while since I've worked in python, my memory of it is that it's slow to implement stuff in. To go the DAWG route, I would probably implement it as a native module and then use it from python (if possible). Otherwise, I'd do some speed checks to verify first but probably go the hashtable route since there are already high performance hashtables in python.

1 Comment

A link to a document on directed acyclic word graphs would probably be in order here. en.wikipedia.org/wiki/Directed_acyclic_word_graph

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.