2

Task

I have a set S of n = 10,000,000 strings s and need to find the set Sp containing the strings s of S that contain the substring p.

Simple solution

As I'm using C# this is quite a simple task using LINQ:

string[] S = new string[] { "Hello", "world" };
string p = "ll";
IEnumerable<string> S_p = S.Where(s => s.Contains(p));

Problem

If S contains many strings (like the mentioned 10,000,000 strings) this gets horribly slow.

Idea

Build some kind of index to retrieve Sp faster.

Question

What is the best way to index S for this task and do you have any implementation in C#?

4
  • Is your set S constant? How many different p are used with the same S? Commented Oct 10, 2014 at 15:04
  • Yes, S is constant. I will use thousands of different p for S. It's a search-engine but the content won't change. Commented Oct 10, 2014 at 15:06
  • Have you considered lucene.net? Commented Oct 10, 2014 at 15:19
  • No I haven't yet. But from the first look it doesn't seem to be what I want. I'd like to have a lightweight solution with as less external dependencies as possible. The best would be a single, small C#-class. Commented Oct 10, 2014 at 15:34

1 Answer 1

1

Here is one way to do it:
1. Create a string T = S[0] + sep_0 + S[1] + sep_1 + ... + S[n - 1] + sep_n-1(where sep_i is a unique character that never appears in S[j] for any j(it can actually be an integer number if the set of characters is not big enough)).
2. Build a suffix tree for T(it can be done in linear time).
3. For each query string Q traverse the suffix tree(it takes O(length(Q)) time). Then all possible answers will be located in the leaves of some subtree. So you can just traverse all these leaves. If Q is rather long, then the number of leaves in this subtree is likely to be much smaller than n.
4. If Q is really short, then the number of leaves in a subtree can be pretty large. That's why you can use another strategy for short query strings: precompute all short substrings of S[0] ... S[n - 1] and for each of them store a set of indices where it has occurred. Then you can just print these indices for a given Q. It is difficult to say what 'short' exactly means here, but it can be found out experimentally.

Sign up to request clarification or add additional context in comments.

11 Comments

I read a bit about suffix trees and it seems that nowadays it's recommended to use suffix arrays instead. Is there anything I oversaw or do you agree?
@user2033412 Finding the range of occurrenses for a query string can be done in O(length(Q) * log n) in a suffix array(not linear as in a suffix tree)unless you use hashing to compare strings, but in general suffix array is a good choice here too.
Can sep_i = sep_j for all i, j or do I need different ones?
@user2033412 For a suffix tree they definitely should be different, for a suffix array using the same separator for all i looks fine.
@user2033412 You still need it. For example, if you concatenate aa and bb without a separator, you will get aabb and eventually find ab there, even though it is not a substring of any of them.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.