0

I have a very long string that I need to compare for equality. Since comparing them char by char is very time consuming, I like to create a hash for the string.

I like the generated hash code be unique ( or the chance that two string with the same hash generated, be very small). I think creating an int from a string as hash is not strong enough to eliminate of having two different string with the same hash code, so I am looking for a string hash code.

Am I right that the above assumption?

To clarify, assume that I have a string of say 1K length and I create a hash code of 10 char, then comparison hash codes speed up by 100 times.

The question that I have is how to create such hash code in c++?

I am developing on windows using visual studio 2012.

7
  • 4
    Try use standart library: std::hash<std::string>, en.cppreference.com/w/cpp/utility/hash Commented Aug 20, 2013 at 11:12
  • A similar question stackoverflow.com/questions/8094790/… Commented Aug 20, 2013 at 11:13
  • 2
    Just to be sure -- you want to compare one string with others multiple times, right? Otherwise, you won't get any speedup. Hashing a string is linear in string length for non-trivial hash functions, just as comparison is. Commented Aug 20, 2013 at 11:38
  • 3
    @arne It's worse than that. Hashing requires looking at every character in the string; with comparison, you stop at the first characters that aren't equal. Commented Aug 20, 2013 at 12:05
  • @JamesKanze: In practice, you're right, but from a computational complexity point-of-view, the worst case complexity of comparison is still linear with string length, i.e. if the last character differs. Commented Aug 20, 2013 at 12:09

5 Answers 5

4

To be useful in this case, the hash code must be quick to calculate. Using anything larger than the largest words supported by the hardware (usually 64 bits) may be counter productive. Still, you can give it a try. I've found the following to work fairly well:

unsigned long long
hash( std::string const& s )
{
    unsigned long long results = 12345; //  anything but 0 is probably OK.
    for ( auto current = s.begin(); current != s.end(); ++ current ) {
        results = 127 * results + static_cast<unsigned char>( *current );
    }
    return results;
}

Using a hash like this will probably not be advantageous, however, unless most of the comparisons are with strings that aren't equal, but have long common initial sequences. Remember that if the hashes are equal, you still have to compare the strings, and that comparison only has to go until the first characters which aren't equal. (In fact, most of the comparison functions I've seen start by comparing length, and only compare characters if the strings are of equal length.)

Sign up to request clarification or add additional context in comments.

Comments

1

There are a lot of hashing algorithms present which you may use.

If you want to implement one by yourself, then a simple one could be to take the ascii for each character and align it with 0(i.e. a = 1, b = 2...) and multiply it with the character index in the string. Keep adding these values and store it as a hash value for a particular string.

For example tha hash value for abc would be:

HASH("abc") = 1*1 + 2*2 + 3*3 = 14; 

The probability of collision lowers as the string length increases(Considering your strings will be lengthy).

8 Comments

Have you done any measurements to show that this hashing is effective? I rather doubt that it would be (but I've not actually tried it either).
No I didn't try it either. I agree that there will be collisions but I suppose that even your method will have.
There will always be some collisions; the goal is to minimize them. Your suggestion will lead to a lot of collisions.
@JamesKanze I do not think so as each letter value is being multiplied by its index as well, Also, as the OP said that the string length is very long so there may be very less collisions.
What you think isn't really relevant. Multiplying each letter by its index will end up causing some letters to be excluded from the hash, so it is, a priori, a bad hashing algorithm. (For short strings, it's particularly bad, but that isn't a problem here.)
|
0

There are many known hash algorithms available. For example MD5, SHA1, etc. You should not need to implement your own algorithm but use one of the available ones. Use the search engine of your choice to find implementations like this one.

1 Comment

Crypto hashes like MD5 are much more expensive to calculate than a simple straight forward hash, and are probably not much better for this particular use.
0

It really depends what your hard requirements are. If you have hard requirements such as "search may never take more than so and so long", then it's possible that no solution is applicable. If your intent is simply to speed up a large number of searches, then a simple, short hash will do fine.

While it is generally true that hashing a 1000-character string to an integer (a single 32-bit or 64-bit number) can, and eventually will produce collisions, this is not something to be concerned about.
A 10-charcter hash will also produce collisions. This is a necessary consequence of the fact that 1000 > 10. For every 10-character hash, there exist 100 1000-character strings1.

The important question is whether you will actually see collisions, how often you will see them, and whether it matters at all. Whether (or how likely) you see a collision depends not on the length of the strings, but on the number of distinct strings.
If you hash 77,100 strings (longer than 4 characters) using a 32-bit hash, you have a 50% chance of encountering a collision for each new hash. At 25,000 strings, the likelihood is only somewhere around 5-6%. At 1000 strings, the likelihood is approximately 0.1%.
Note that when I say "50% at 77,100 strings", this does not mean that your chance of actually encountering a collision is that high. It's merely the chance of having two strings with the same hash value. Unless that's the case for the majority of strings, the chance of actually hitting one is again a lot lower.

Which means no more and no less than for most usage cases, it simply doesn't matter. Unless you want to hash hundred thousands of strings, stop worrying now and use a 32-bit hash.
Otherwise, unless you want to hash billions of strings, stop worrying here and use a 64-bit hash.

Thing is, you must be prepared to handle collisions in any case because as long as you have 2 strings, the likelihood for a collision is never exactly zero. Even hashing only 2 or 3 1000-character strings into a 500-byte hash could in principle have a collision (very unlikely but possible).
Which means you must do a string comparison if the hash matches in either case, no matter how long (or how good or bad) your hash is.

If collisions don't happen every time, they're entirely irrelevant. If you have a lot of collisions in your table and encounter one, say, on 1 in 10,000 lookups (which is a lot!), it has no practical impact. Yes, you will have to do a useless string comparison once in 10,000 lookups, but the other 9,999 work by comparing a single integer alone. Unless you have a hard realtime requirement, the measurable impact is exactly zero.
Even if you totally screw up and encounter a collision on every 5th search (pretty desastrous case, that would mean that roughly 800 million string pairs collide, which is only possible at all with at least 1,6 billion strings), this still means that 4 out of 5 searches don't hit a collision, so you still discard 80% of non-matches without doing a comparison.

On the other hand, generating a 10-character hash is cumbersome and slow, and you are likely to create a hash function that has more collision (because of bad design) than a readily existing 32 or 64 bit hash.
Cryptographic hash functions are certainly better, but they run slower than their non-cryptographic counterparts too, and the storage required to store their 16 or 32 byte hash values is much larger, too (at virtually no benefit, to most people). This is a space/time tradeoff.

Personally, I would just use something like djb2, which can be implemented in 3 lines of C code, works well, and runs very fast. There exist of course many other hash functions that you could use, but I like djb2 for its simplicity.

Funnily, after reading James Kanze's answer, posted code seems to be a variation of djb2, only with a different seed and a different multiplier (5381 and 33, respectively).
In the same answer, the remark about comparing string lengths first is a good tip as well. It's noteworthy that you can consider a string's lenght a form of "hash function", too (albeit a rather weak one, but one that often comes "for free").


1However, strings are not some "random binary garbage" as the hash is. They are structured, low-entropy data. Insofar, the comparison does not really hold true.

Comments

0

Well, I would first compare string lenghts. If they match, then I'd start comparing using an algorithm that uses random positions to test character equality, and stop at first difference. The random positions would be obtained from a stringLength sized vector, filled with random ints ranging from 0 to stringLength-1. I haven't measured this method, though, it's just an idea. But this would save you the concerns of hash collisions, while reducing comparison times.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.