7

The implementation of Nullable<T>.GetHashCode() is as follows:

public override int GetHashCode()
{
    if (!this.HasValue)
    {
        return 0;
    }
    return this.value.GetHashCode();
}

If however the underlying value also generates a hash code of 0 (e.g. a bool set to false or an int32 set to 0), then we have two commonly occurring different object states with the same hash code. It seems to me that a better implementation would have been something like.

public override int GetHashCode()
{
    if (!this.HasValue)
    {
        return 0xD523648A; // E.g. some arbitrary 32 bit int with a good mix of set and 
                           // unset bits (also probably a prime number).
    }
    return this.value.GetHashCode();
}
13
  • 5
    What makes you think that the underlying GetHashCode method is more likely to produce 0 than 0x12345678? Commented Nov 23, 2012 at 11:24
  • 3
    @stakx because, for instance and just for instance, int.GetHashCode(x) = x. 0 is much more likely than a random constant for any given integer in real applications. Commented Nov 23, 2012 at 11:27
  • 2
    Is this question resulting from any issue surrounding this or is this purely theoretical? Any situation where this has even the slightest implications on actual software performance wise you should be looking at not using Nullable<T> isntead. Commented Nov 23, 2012 at 11:29
  • 3
    not unique but to produce as few collisions as possible Commented Nov 23, 2012 at 11:30
  • 3
    @slawekwin: Yes, I happened to encounter 0xD523648A already twice to day ;) Commented Nov 23, 2012 at 11:40

4 Answers 4

4

Yes, you do have a point. It is always possible to write a better GetHashCode() implementation if you know up front what data you are going to store. Not a luxury that a library writer ever has available. But yes, if you have a lot of bool? that are either false or !HasValue then the default implementation is going to hurt. Same for enums and ints, zero is a common value.

Your argument is academic however, changing the implementation costs minus ten thousand points and you can't do it yourself. Best you can do is submit the suggestion, the proper channel is the user-voice site. Getting traction on this is going to be difficult, good luck.

Sign up to request clarification or add additional context in comments.

3 Comments

That sounds to me like : If you know your domain, you most likely can do better. If you want nullable structs and know you can do better Nullable<T>, build it. You won't have compiler support for it, but hey, you're in a pretty specific situation already.
Nullable is a tough case, it is directly supported by the CLR to handle its unusual boxing behavior. Not that easy to replace.
Ah, interesting, and didn't know that, food for investigation :) It still leaves open the possibility of rolling your own though. The performance issue was GetHashCode(), not boxing / unboxing.
2

Let's first note that this question is just about performance. The hash code is not required to be unique or collision resistant for correctness. It is helpful for performance though.

Actually, this is the main value proposition of a hash table: Practically evenly distributed hash codes lead to O(1) behavior.

So what hash code constant is most likely to lead to the best possible performance profile in real applications?

Certainly not 0 because 0 is a common hash code: 0.GetHashCode() == 0. That goes for other types as well. 0 is the worst candidate because it tends to occur so often.

So how to avoid collisions? My proposal:

static readonly int nullableDefaultHashCode = GetRandomInt32();
public override int GetHashCode()
{
    if (!this.HasValue)
        return nullableDefaultHashCode;
    else
        return this.value.GetHashCode();
}

Evenly distributed, unlikely to collide and no stylistic problem of choosing an arbitrary constant.

Note, that GetRandomInt32 could be implemented as return 0xD523648A;. It would still be more useful than return 0;. But it is probably best to query a cheap source of pseudo-random numbers.

7 Comments

This just muddies the waters. If performance of GetHashCode() is your issue, and you generate a default hash upon type initialization, then good luck between determining the performance problems between runs of your application. One run it's there, the next, poof, gone.
That's a good point. So set in a constant, then. My main points still apply.
But when you set it in a constant, you're trying to solve a performance problem for every possible struct that any sentient being programming in C# might come up with again.
@Willem: Zero is a constant.
My point is that non-zero numbers are likely, not guaranteed, to perform better. You are saying that there are cases where they don't. But those cases are clearly not as common. Why choose the worst possible number? Why choose the number that is maximally likely to collide?
|
1

In the end, a Nullable<T> without value has to return a hashcode, and that hashcode should be a constant.

Returning an arbitrary constant may look more safe or appropriate, perhaps even more so when viewed within the specific case of Nullable<int>, but in the end it's just that: a hash.

And within the entire set that Nullable<T> can cover (which is infinite), zero is not a better hashcode than any other value.

4 Comments

You are excluding performance concerns from this answer. This question is just about performance.
@usr: And how would you implement generic performance for Nullable<T> over ALL possible structs in the entire universe?
It is not possible but we can try to come close. Also see my answer which goes further.
Any explanation about the downvotes? Not worried about them, I'd just like to hear the arguments, it's an interesting discussion :)
0

I don't understand the concern here - poor performance in what situation?

Why would you could consider a hash function as poor based on its result for one value.

I could see that it would be a problem if many different values of a Type hash to the same result. But the fact that null hashes to the same value as 0 seems insignificant.

As far as I know the most common use of a .NET hash function is for a Hashtable, HashSet or Dictionary key, and the fact that zero and null happen to be in the same bucket will have an insignificant effect on overall performance.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.