4

Note: This question isn't specifically related to Python. I just used it as a substitute for pseudo-code here.

Given an array A containing N strings with an average length M, I want to create a new array B that only contains those strings in A that are not a substring (or and identical copy) of any other string in A. Here is an example:

A = [ 'foo', 'bar', 'foobar', 'foobar' ]
B = [ 'foobar' ]

I'm specifically looking for the most efficient way of doing this in terms of time complexity. The naive approach would look like this

B = []
for i in range(0, len(A)):
    noSubstring = True
    for j in range(i + 1, len(A)):
        if A[i] in A[j]:
            noSubstring = False
            break
    if noSubstring:
        B.append(A[i])

and has a time complexity of O(N^2 * M^2). Is there anything I could do to speed this up?

I've been thinking about using a dedicated data structure for efficiently encoding and reusing the string sequences. For instance, if I wanted to remove strings that are only a prefix of another string in the array, I could create a trie / prefix tree (O(N*M)) and then collect all the leaf elements (another O(N*M)). So far I failed, to adapt this approach to the more general substring problem, however.

3
  • You might want to use a set for A, since you don't want duplicates anyway. Commented Aug 18, 2015 at 11:26
  • 1
    Here's this question, with an answer, on Computer Science: Given n strings, is one of them a substring of another? Commented Aug 18, 2015 at 11:27
  • In your naiva implementation you could avoid doing the string search if A[i] is longer than A[j] Commented Aug 18, 2015 at 11:28

1 Answer 1

5

First eliminate all duplicates. This is fairly easy to do by using a hash-table while iterating the data, and storing already seen strings. (If you are afraid about worst case behavior of hash tables, you could use a trie or sort and iterate to filter out dupes)

Once you have filtered out all duplicates, create a suffix-tree for all the remaining strings.
After creating the suffix tree, for each string, check if it exists as a suffix of some string which is not itself. This is done by following the path on the suffix tree from the root until the end of the string, and if your only option is the exact same string, it's not a substring (otherwise - it is).

Time complexity:

  • Filtering out dupes: O(n*m)
  • Building suffix tree O(n*m) in theory, but practically it is done in O(n*mlog(m)).
  • Checking each string is then O(m), repeated for n strings is O(nm)

Total complexity is O(n*mlog(m))

Sign up to request clarification or add additional context in comments.

3 Comments

Filtering out dupes - Hashtable approach will be theoretically O(n), right?
@thefourtheye O(nm), you need to read all the strings, and each is taking O(m) time. It can also be done by sorting, or even by creating a trie out of the strings if you are afraid of worst case behavior of hashing.
I think a suffix array would be more memory efficient in this case. It would increase the check phase from O(nm) to O(nm log m), but since the construction is itself that complexity already, it wouldn't be a problem.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.