10

The JDK documentation for java.lang.String.hashCode() famously says:

The hash code for a String object is computed as

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

using int arithmetic, where s[i] is the *i*th character of the string, n is the length of the string, and ^ indicates exponentiation.

The standard implementation of this expression is:

int hash = 0;
for (int i = 0; i < length; i++)
{
    hash = 31*hash + value[i];
}
return hash;

Looking at this makes me feel like I was sleeping through my algorithms course. How does that mathematical expression translate into the code above?

5 Answers 5

26

unroll the loop. Then you get:

int hash = 0;

hash = 31*hash + value[0];
hash = 31*hash + value[1];
hash = 31*hash + value[2];
hash = 31*hash + value[3];
...
return hash;

Now you can do some mathematical manipulation, plug in 0 for the initial hash value:

hash = 31*(31*(31*(31*0 + value[0]) + value[1]) + value[2]) + value[3])...

Simplify it some more:

hash = 31^3*value[0] + 31^2*value[1] + 31^1*value[2] + 31^0*value[3]...

And that is essentially the original algorithm given.

Sign up to request clarification or add additional context in comments.

2 Comments

You may want to explain it in terms of static single assignment (SSA) form, which then removes the need to think about what value "hash" has at any given point in time. :-)
Looks like the original algorithm says it should be: 31^3*value[0] + 31^2*value[1] + 31^1*value[2] + ... Or is it just my fried brain misfiring?
13

I'm not sure if you missed where it says "^ indicates exponentiation" (not xor) in that documentation.

Each time through the loop, the previous value of hash is multipled by 31 again before being added to the next element of value.

One could prove these things are equal by induction, but I think an example might be more clear:

Say we're dealing with a 4-char string. Let's unroll the loop:

hash = 0;
hash = 31 * hash + value[0];
hash = 31 * hash + value[1];
hash = 31 * hash + value[2];
hash = 31 * hash + value[3];

Now combine these into one statement by substituting each value of hash into the following statement:

hash = 31 * (31 * (31 * (31 * 0 + value[0]) + value[1]) + value[2])
     + value[3];

31 * 0 is 0, so simplify:

hash = 31 * (31 * (31 * value[0] + value[1]) + value[2])
     + value[3];

Now multiply the two inner terms by that second 31:

hash = 31 * (31 * 31 * value[0] + 31 * value[1] + value[2])
     + value[3];

Now multiply the three inner terms by that first 31:

hash = 31 * 31 * 31 * value[0] + 31 * 31 * value[1] + 31 * value[2]
     + value[3];

and convert to exponents (not really Java anymore):

hash = 31^3 * value[0] + 31^2 * value[1] + 31^1 * value[2] + value[3];

4 Comments

RE your first sentence: Did you see some evidence that the question or a particular answer was assuming xor?
You'd expressed confusion about how the code and the documentation could be equivalent. Since the documentation was using "^" for exponentiation, but Java normally uses it to mean bitwise xor I wondered if that was the source of your confusion. (There were no other answers when I started writing my answer, BTW)
Ahh, I see. No, I was aware that it was exponentiation, but unclear on how the implementation followed from the mathematical expression. Your answer clarifies that greatly--but knowing to write that code given only that expression is still a leap for me. To arrive at that code, it would seem that you'd have to write out a small example, realize that you can "multiply by 0 in a clever way" in the innermost nesting to complete the pattern, then form the loop.
It would not surprise me at all if the code actually came first, and the documentation was written afterwards.
10

Proof by induction:

T1(s) = 0 if |s| == 0, else s[|s|-1] + 31*T(s[0..|s|-1])
T2(s) = s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
P(n) = for all strings s s.t. |s| = n, T1(s) = T2(s)

Let s be an arbitrary string, and n=|s|
Base case: n = 0
    0 (additive identity, T2(s)) = 0 (T1(s))
    P(0)
Suppose n > 0
    T1(s) = s[n-1] + 31*T1(s[0:n-1])
    T2(s) = s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1] = s[n-1] + 31*(s[0]*31^(n-2) + s[1]*31^(n-3) + ... + s[n-2]) = s[n-1] + 31*T2(s[0:n-1])
    By the induction hypothesis, (P(n-1)), T1(s[0:n-1]) = T2(s[0:n-1]) so
        s[n-1] + 31*T1(s[0..n-1]) = s[n-1] + T2(s[0:n-1])
    P(n)

I think I have it, and a proof was requested.

Comments

9

Take a look at the first few iterations and you'll see the pattern start to emerge:

hash0 = 0 + s0 = s0
hash1 = 31(hash0) + s1 = 31(s0) + s1
hash2 = 31(hash1) + s2 = 31(31(s0) + s1) + s2 = 312(s0) + 31(s1) + s2
...

3 Comments

<3 Thanks for (more or less) writing out CookieOfFortune's answer in SSA form. Much appreciated!
Would be even better if you could vertically align all the corresponding terms, and distribute the 31(...) in the third line.
@CookieOfFortune: There's an HTML tag for it. Look at the page source. I'd have used Unicode, though.
0

Isn't it useless at all to count the hashcode of the String out of all characters? Imagine filenames or classnames with their full path put into HashSet. Or someone who uses HashSets of String documents instead of Lists because "HashSet always beats Lists".

I would do something like:

int off = offset;
char val[] = value;
int len = count;

int step = len <= 10 ? 1 : len / 10;

for (int i = 0; i < len; i+=step) {
   h = 31*h + val[off+i];
}
hash = h

At the end hashcode is nothing more than a hint.

3 Comments

Ignoring half the characters in the string would mean that storing a sequence of "counting strings" into a hash table could easily cause 100 strings to map to each hash value. Ignoring more than half the characters would make things even worse. Ignoring any aspect of the string for hashing purposes risks a really huge penalty in exchange for a pretty small payoff.
That's essentially what the early designers of java though. Initially the string hash function took only a sample of characters when the string was longer than 15 characters. Eventually it had to be fixed because it turned out to yield very bad hash performance with certain strings (e.g. with set of URLs which often look similar): bugs.java.com/bugdatabase/view_bug.do?bug_id=4045622. The performance gains for not using the whole string can not offset the much worse hash performance.
To clarify: the second type of performance if referring to the "hash table" performance, not to the raw speed of computing the hash.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.