Most efficient way to remove substrings from an array of strings

Question

Note: This question isn't specifically related to Python. I just used it as a substitute for pseudo-code here.

Given an array A containing N strings with an average length M, I want to create a new array B that only contains those strings in A that are not a substring (or and identical copy) of any other string in A. Here is an example:

A = [ 'foo', 'bar', 'foobar', 'foobar' ]
B = [ 'foobar' ]

I'm specifically looking for the most efficient way of doing this in terms of time complexity. The naive approach would look like this

B = []
for i in range(0, len(A)):
    noSubstring = True
    for j in range(i + 1, len(A)):
        if A[i] in A[j]:
            noSubstring = False
            break
    if noSubstring:
        B.append(A[i])

and has a time complexity of O(N^2 * M^2). Is there anything I could do to speed this up?

I've been thinking about using a dedicated data structure for efficiently encoding and reusing the string sequences. For instance, if I wanted to remove strings that are only a prefix of another string in the array, I could create a trie / prefix tree (O(N*M)) and then collect all the leaf elements (another O(N*M)). So far I failed, to adapt this approach to the more general substring problem, however.

You might want to use a set for A, since you don't want duplicates anyway. — Cyphase
– Cyphase, Commented Aug 18, 2015 at 11:26
Here's this question, with an answer, on Computer Science: Given n strings, is one of them a substring of another? — Cyphase
– Cyphase, Commented Aug 18, 2015 at 11:27
In your naiva implementation you could avoid doing the string search if A[i] is longer than A[j] — DisappointedByUnaccountableMod
– DisappointedByUnaccountableMod, Commented Aug 18, 2015 at 11:28

amit · Accepted Answer · 2015-08-18 11:39:28Z

5

First eliminate all duplicates. This is fairly easy to do by using a hash-table while iterating the data, and storing already seen strings. (If you are afraid about worst case behavior of hash tables, you could use a trie or sort and iterate to filter out dupes)

Once you have filtered out all duplicates, create a suffix-tree for all the remaining strings.
After creating the suffix tree, for each string, check if it exists as a suffix of some string which is not itself. This is done by following the path on the suffix tree from the root until the end of the string, and if your only option is the exact same string, it's not a substring (otherwise - it is).

Time complexity:

Filtering out dupes: O(n*m)
Building suffix tree O(n*m) in theory, but practically it is done in O(n*mlog(m)).
Checking each string is then O(m), repeated for n strings is O(nm)

Total complexity is O(n*mlog(m))

edited Aug 18, 2015 at 11:39

answered Aug 18, 2015 at 11:29

amit

179k27 gold badges245 silver badges348 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

thefourtheye Over a year ago

Filtering out dupes - Hashtable approach will be theoretically O(n), right?

amit Over a year ago

@thefourtheye O(nm), you need to read all the strings, and each is taking O(m) time. It can also be done by sorting, or even by creating a trie out of the strings if you are afraid of worst case behavior of hashing.

Juan Lopes Over a year ago

I think a suffix array would be more memory efficient in this case. It would increase the check phase from O(nm) to O(nm log m), but since the construction is itself that complexity already, it wouldn't be a problem.

Collectives™ on Stack Overflow

Most efficient way to remove substrings from an array of strings

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related