1

I am just checking for an efficient algorithm with best computational complexity to check if a child string - tobeVerified exists in a huge parent string

I was going through different algorithms but I am yet to find something which offers O(n)

I came up with the below implementation using HashSet which is gives me O(n+m) ~ O(n)

I wanted to check if this is the right way of doing it or if any other optimization is possible. But in this approach there is problem of consuming more space

String parent = "the value is very high";
    String tobeVerified = "is";
    Set wordSet = new HashSet<String>();    
    String[] words = parent.trim().toUpperCase().split("\\s+");
    //This is O(n) n - Parent Size  m - substring size
    for(String word: words){
        wordSet.add(word);      
    }
    //This is O(1)
    System.out.println(wordSet.contains(tobeVerified.toUpperCase()));
    }
14
  • 1
    What's wrong with String.contains()? My guess is that it's way more efficient than your algorithm. Commented Jan 14, 2017 at 7:53
  • @JBNizet .. String.contains() has a complexity of O(m*n). It again calls indexOf() which has this complexity .. please check here - stackoverflow.com/questions/12752274/… Commented Jan 14, 2017 at 7:55
  • 1
    So what? What makes you think it's slower than trimming the whole string (already O(n) if a trim is needed)), converting to uppercase (O(n)), splitting with a regex (O(n) with a big number of memory allocation, copies of arrays, creation of many String objects), filling a hashset (O(n), creation of many entry objects, memory allocation, hashCode computation), lookup (O(1)), garbage collection. I mean just comparing each word you put in the set to the searched string, and not building the set at all would already be faster. Commented Jan 14, 2017 at 8:04
  • 1
    Do you use the same huge string many times with different patterns? Commented Jan 14, 2017 at 8:07
  • 1
    @AdityaReddy So you can try to build appropriate data structure once - suffix array, suffix tree or trie, and search patterns in it with high speed Commented Jan 14, 2017 at 8:28

2 Answers 2

3

One of the classic O(n+m) substring search algorithms is Boyer-Moore. It should have better performance than String.contains or String.indexOf for sufficiently large strings.

There's a Java implementation of the algorithm on that wikipedia page link above, but it's written to use a char[] array as input instead of on an instance of the String class. So either modify the code to work with a String parameter , or take into account the additional cost, O(n), of cloning a String into a char[].

One little issue I spotted on the wikipedia code. It assumes character values are only in the 8-bit range. You might need to modify this line:

final int ALPHABET_SIZE = 256;

To be this:

final int ALPHABET_SIZE = 65536;

Update: I updated the wikipedia page code appropriately to have the correct value for ALPHABET_SIZE. Confirmed the original bug existed and wrote a unit test to validate the fix.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks .. this looks promising .. will run some tests with this
I fixed the code on the wikipedia page with the change described above. I wrote a unit test to confirm different string patterns.
1

You could go for a Boyer-Moore implementation, as suggested in answer by selbie, if profiling shows that you truly have a performance issue.

Until then, just do a simple regex search:

String textToSearch = "the value is very high";
String wordToFind = "is";
String regex = "(?i)\\b" + Pattern.quote(wordToFind) + "\\b";
boolean found = Pattern.compile(regex).matcher(textToSearch).find();

(?i) makes the search case-insensitive, and \\b matches word-boundary, e.g. ensuring that is will not match this. Since you're doing word-search, Pattern.quote() is likely unnecessary, but better be safe than sorry.

1 Comment

what about if we want to also get the count of matches?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.