Fast way to find strings in set of strings containing substring

Question

Task

I have a set S of n = 10,000,000 strings s and need to find the set S_p containing the strings s of S that contain the substring p.

Simple solution

As I'm using C# this is quite a simple task using LINQ:

string[] S = new string[] { "Hello", "world" };
string p = "ll";
IEnumerable<string> S_p = S.Where(s => s.Contains(p));

Problem

If S contains many strings (like the mentioned 10,000,000 strings) this gets horribly slow.

Idea

Build some kind of index to retrieve S_p faster.

Question

What is the best way to index S for this task and do you have any implementation in C#?

Is your set S constant? How many different p are used with the same S? — MBo
– MBo, Commented Oct 10, 2014 at 15:04
Yes, S is constant. I will use thousands of different p for S. It's a search-engine but the content won't change. — user2033412
– user2033412, Commented Oct 10, 2014 at 15:06
No I haven't yet. But from the first look it doesn't seem to be what I want. I'd like to have a lightweight solution with as less external dependencies as possible. The best would be a single, small C#-class. — user2033412
– user2033412, Commented Oct 10, 2014 at 15:34

kraskevich · Accepted Answer · 2014-10-10 14:43:35Z

1

Here is one way to do it:
1. Create a string T = S[0] + sep_0 + S[1] + sep_1 + ... + S[n - 1] + sep_n-1(where sep_i is a unique character that never appears in S[j] for any j(it can actually be an integer number if the set of characters is not big enough)).
2. Build a suffix tree for T(it can be done in linear time).
3. For each query string Q traverse the suffix tree(it takes O(length(Q)) time). Then all possible answers will be located in the leaves of some subtree. So you can just traverse all these leaves. If Q is rather long, then the number of leaves in this subtree is likely to be much smaller than n.
4. If Q is really short, then the number of leaves in a subtree can be pretty large. That's why you can use another strategy for short query strings: precompute all short substrings of S[0] ... S[n - 1] and for each of them store a set of indices where it has occurred. Then you can just print these indices for a given Q. It is difficult to say what 'short' exactly means here, but it can be found out experimentally.

answered Oct 10, 2014 at 14:43

kraskevich

18.6k4 gold badges37 silver badges46 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

user2033412 Over a year ago

I read a bit about suffix trees and it seems that nowadays it's recommended to use suffix arrays instead. Is there anything I oversaw or do you agree?

kraskevich Over a year ago

@user2033412 Finding the range of occurrenses for a query string can be done in O(length(Q) * log n) in a suffix array(not linear as in a suffix tree)unless you use hashing to compare strings, but in general suffix array is a good choice here too.

user2033412 Over a year ago

Can sep_i = sep_j for all i, j or do I need different ones?

kraskevich Over a year ago

@user2033412 For a suffix tree they definitely should be different, for a suffix array using the same separator for all i looks fine.

kraskevich Over a year ago

@user2033412 You still need it. For example, if you concatenate aa and bb without a separator, you will get aabb and eventually find ab there, even though it is not a substring of any of them.

|

Collectives™ on Stack Overflow

Fast way to find strings in set of strings containing substring

Task

Simple solution

Problem

Idea

Question

1 Answer 1

11 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Task

Simple solution

Problem

Idea

Question

1 Answer 1

11 Comments

Your Answer

Sign up or log in

Post as a guest

Related