Searching string for different substrings

Question

I have a string. I need to know if any of the following substrings appear in the string.

So, if I have:

thing_name = "VISA ASSESSMENTS"

I've been doing my searches with:

any((_ in thing_name for _ in ['ASSESSMENTS','KILOBYTE','INTERNATIONAL']))

I'm going through a long list of thing_name items, and I don't need to filter, exactly, just check for any number of substrings.

Is this the best way to do this? It feels wrong, but I can't think of a more efficient way to pull this off.

Are you searching for exact words VISA and ASSESSMENTS or any substring? — Ashwini Chaudhary
– Ashwini Chaudhary, Commented Aug 19, 2013 at 18:22
You could probably speed things up a bit if your list of possible substrings were a set, but other than that I don't think there's anything wrong with this approach. This is fine and readable. The next step would be serious algorithm design. — kojiro
– kojiro, Commented Aug 19, 2013 at 18:39
@AshwiniChaudhary, I am looking for either 'ASSESSMENTS', 'KILOBYTES' or one of several other items. — Josh English
– Josh English, Commented Aug 19, 2013 at 18:44
@JoshEnglish I don't know your initial data, but it could be worth eliminating substrings within the set: for example, if you have {'foo', 'bar', 'foobaz'} you can remove foobaz from the set. — kojiro
– kojiro, Commented Aug 19, 2013 at 18:47

Subhasis Das · Accepted Answer · 2013-08-19 19:29:16Z

1

You can try re.search to see if that is faster. Something along the lines of

import re
pattern = re.compile('|'.join(['ASSESSMENTS','KILOBYTE','INTERNATIONAL']))
isMatch = (pattern.search(thing_name) != None)

answered Aug 19, 2013 at 19:29

Subhasis Das

1,67713 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Josh English Over a year ago

Thanks for the suggestions, everyone. As a result of this, I did some timing and found the re solutions to be the fastest in my situation. I came up with eight different ways of doing this, and taking data from my situation, given the length and number of fields, the RE is fastest. Pre-Compiling the regular expression was the best, and second best was compiling the regular expression during each run through my list. Set intersection was the third fastest, but the surprise came when checking the length of set and comparing to 0 was faster than simply transforming the intersection to boolean.

Bakuriu Over a year ago

I'd suggest to use re.escape on the substrings before passing them to re.compile(i.e. re.compile('|'.join(re.escape(sub) for sub in [...]))), otherwise a substring that contains non-alphanumeric characters might create problems(it can even result in re.compile failing).

Subhasis Das Over a year ago

@Bakuriu, +1 for telling about re.escape.

user2694688 · Accepted Answer · 2013-08-19 19:01:42Z

0

If your list of substrings is small and the input is small, then using a for loop to do compares is fine.

Otherwise the fastest way I know to search a string for a (large) list of substrings is to construct a DAWG of the word list and then iterate through the input string, keeping a list of DAWG traversals and registering the substrings at each successful traverse.

Another way is to add all the substrings to a hashtable and then hash every possible substring (up to the length of the longest substring) as you traverse the input string.

It's been a while since I've worked in python, my memory of it is that it's slow to implement stuff in. To go the DAWG route, I would probably implement it as a native module and then use it from python (if possible). Otherwise, I'd do some speed checks to verify first but probably go the hashtable route since there are already high performance hashtables in python.

answered Aug 19, 2013 at 19:01

user2694688

1 Comment

kojiro Over a year ago

A link to a document on directed acyclic word graphs would probably be in order here. en.wikipedia.org/wiki/Directed_acyclic_word_graph

Collectives™ on Stack Overflow

Searching string for different substrings

2 Answers 2

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related