Proof: why does java.lang.String.hashCode()'s implementation match its documentation?

Question

The JDK documentation for java.lang.String.hashCode() famously says:

The hash code for a String object is computed as
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
using int arithmetic, where s[i] is the *i*th character of the string, n is the length of the string, and ^ indicates exponentiation.

The standard implementation of this expression is:

int hash = 0;
for (int i = 0; i < length; i++)
{
    hash = 31*hash + value[i];
}
return hash;

Looking at this makes me feel like I was sleeping through my algorithms course. How does that mathematical expression translate into the code above?

CookieOfFortune · Accepted Answer · 2009-05-04 23:16:32Z

26

unroll the loop. Then you get:

int hash = 0;

hash = 31*hash + value[0];
hash = 31*hash + value[1];
hash = 31*hash + value[2];
hash = 31*hash + value[3];
...
return hash;

Now you can do some mathematical manipulation, plug in 0 for the initial hash value:

hash = 31*(31*(31*(31*0 + value[0]) + value[1]) + value[2]) + value[3])...

Simplify it some more:

hash = 31^3*value[0] + 31^2*value[1] + 31^1*value[2] + 31^0*value[3]...

And that is essentially the original algorithm given.

edited May 4, 2009 at 23:16

answered May 4, 2009 at 22:26

CookieOfFortune

14k8 gold badges45 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

C. K. Young Over a year ago

You may want to explain it in terms of static single assignment (SSA) form, which then removes the need to think about what value "hash" has at any given point in time. :-)

Adnan Over a year ago

Looks like the original algorithm says it should be: 31^3*value[0] + 31^2*value[1] + 31^1*value[2] + ... Or is it just my fried brain misfiring?

Laurence Gonsalves · Accepted Answer · 2009-05-04 22:45:36Z

13

I'm not sure if you missed where it says "^ indicates exponentiation" (not xor) in that documentation.

Each time through the loop, the previous value of hash is multipled by 31 again before being added to the next element of value.

One could prove these things are equal by induction, but I think an example might be more clear:

Say we're dealing with a 4-char string. Let's unroll the loop:

hash = 0;
hash = 31 * hash + value[0];
hash = 31 * hash + value[1];
hash = 31 * hash + value[2];
hash = 31 * hash + value[3];

Now combine these into one statement by substituting each value of hash into the following statement:

hash = 31 * (31 * (31 * (31 * 0 + value[0]) + value[1]) + value[2])
     + value[3];

31 * 0 is 0, so simplify:

hash = 31 * (31 * (31 * value[0] + value[1]) + value[2])
     + value[3];

Now multiply the two inner terms by that second 31:

hash = 31 * (31 * 31 * value[0] + 31 * value[1] + value[2])
     + value[3];

Now multiply the three inner terms by that first 31:

hash = 31 * 31 * 31 * value[0] + 31 * 31 * value[1] + 31 * value[2]
     + value[3];

and convert to exponents (not really Java anymore):

hash = 31^3 * value[0] + 31^2 * value[1] + 31^1 * value[2] + value[3];

answered May 4, 2009 at 22:45

Laurence Gonsalves

144k38 gold badges264 silver badges315 bronze badges

4 Comments

David Citron Over a year ago

RE your first sentence: Did you see some evidence that the question or a particular answer was assuming xor?

Laurence Gonsalves Over a year ago

You'd expressed confusion about how the code and the documentation could be equivalent. Since the documentation was using "^" for exponentiation, but Java normally uses it to mean bitwise xor I wondered if that was the source of your confusion. (There were no other answers when I started writing my answer, BTW)

David Citron Over a year ago

Ahh, I see. No, I was aware that it was exponentiation, but unclear on how the implementation followed from the mathematical expression. Your answer clarifies that greatly--but knowing to write that code given only that expression is still a leap for me. To arrive at that code, it would seem that you'd have to write out a small example, realize that you can "multiply by 0 in a clever way" in the innermost nesting to complete the pattern, then form the loop.

Laurence Gonsalves Over a year ago

It would not surprise me at all if the code actually came first, and the documentation was written afterwards.

Devin Jeanpierre · Accepted Answer · 2009-05-04 22:44:54Z

10

Proof by induction:

T1(s) = 0 if |s| == 0, else s[|s|-1] + 31*T(s[0..|s|-1])
T2(s) = s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
P(n) = for all strings s s.t. |s| = n, T1(s) = T2(s)

Let s be an arbitrary string, and n=|s|
Base case: n = 0
    0 (additive identity, T2(s)) = 0 (T1(s))
    P(0)
Suppose n > 0
    T1(s) = s[n-1] + 31*T1(s[0:n-1])
    T2(s) = s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1] = s[n-1] + 31*(s[0]*31^(n-2) + s[1]*31^(n-3) + ... + s[n-2]) = s[n-1] + 31*T2(s[0:n-1])
    By the induction hypothesis, (P(n-1)), T1(s[0:n-1]) = T2(s[0:n-1]) so
        s[n-1] + 31*T1(s[0..n-1]) = s[n-1] + T2(s[0:n-1])
    P(n)

I think I have it, and a proof was requested.

edited May 4, 2009 at 22:44

answered May 4, 2009 at 22:31

Devin Jeanpierre

96.1k5 gold badges59 silver badges80 bronze badges

Comments

Bobby Eickhoff · Accepted Answer · 2009-05-04 22:31:46Z

9

Take a look at the first few iterations and you'll see the pattern start to emerge:

hash₀ = 0 + s₀ = s₀
hash₁ = 31(hash₀) + s₁ = 31(s₀) + s₁
hash₂ = 31(hash₁) + s₂ = 31(31(s₀) + s₁) + s₂ = 31²(s₀) + 31(s₁) + s₂
...

answered May 4, 2009 at 22:31

Bobby Eickhoff

2,5962 gold badges23 silver badges22 bronze badges

3 Comments

C. K. Young Over a year ago

<3 Thanks for (more or less) writing out CookieOfFortune's answer in SSA form. Much appreciated!

Nikhil Over a year ago

Would be even better if you could vertically align all the corresponding terms, and distribute the 31(...) in the third line.

Nikhil Over a year ago

@CookieOfFortune: There's an HTML tag for it. Look at the page source. I'd have used Unicode, though.

Community · Accepted Answer · 2017-05-23 12:30:25Z

0

Isn't it useless at all to count the hashcode of the String out of all characters? Imagine filenames or classnames with their full path put into HashSet. Or someone who uses HashSets of String documents instead of Lists because "HashSet always beats Lists".

I would do something like:

int off = offset;
char val[] = value;
int len = count;

int step = len <= 10 ? 1 : len / 10;

for (int i = 0; i < len; i+=step) {
   h = 31*h + val[off+i];
}
hash = h

At the end hashcode is nothing more than a hint.

edited May 23, 2017 at 12:30

CommunityBot

11 silver badge

answered Jul 21, 2013 at 14:54

David

1

3 Comments

supercat Over a year ago

Ignoring half the characters in the string would mean that storing a sequence of "counting strings" into a hash table could easily cause 100 strings to map to each hash value. Ignoring more than half the characters would make things even worse. Ignoring any aspect of the string for hashing purposes risks a really huge penalty in exchange for a pretty small payoff.

David Ongaro Over a year ago

That's essentially what the early designers of java though. Initially the string hash function took only a sample of characters when the string was longer than 15 characters. Eventually it had to be fixed because it turned out to yield very bad hash performance with certain strings (e.g. with set of URLs which often look similar): bugs.java.com/bugdatabase/view_bug.do?bug_id=4045622. The performance gains for not using the whole string can not offset the much worse hash performance.

David Ongaro Over a year ago

To clarify: the second type of performance if referring to the "hash table" performance, not to the raw speed of computing the hash.

Collectives™ on Stack Overflow

Proof: why does java.lang.String.hashCode()'s implementation match its documentation?

5 Answers 5

2 Comments

4 Comments

Comments

3 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

4 Comments

Comments

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related