2

One module in my app generates a small array of integers. Typically the size is 25 integers. The integers tend to be pretty small, less than 10000. I'll like to save all the unique arrays in a container of some sort. The number of arrays generated can be in the millions.

So, for every new array I need to figure out if it already exits. And if it does what's the index.

A naive approach is to keep all arrays in a list and then just call:

MyList.FindIndex(x=>x.SequenceEqual(Small_Array));

But this becomes very slow if the number of arrays is getting into the thousands.

A less naive approach is to store all arrays in a dictionary where the key is a hash value from the array. If the hash is just another integer (32bit) than I have cannot find a good hashing algorithm which doesn't collides.

Which, I think leaves me to using a hashing algorithm like MD5 that can be converted into a 128bit integer. Is that a good way to tackle my problem?

5
  • Why does it matter if the hashing algorithm collides or not? Why not just use the hash as a basis for deciding if you should check the sequence? Commented Jun 6, 2018 at 15:21
  • 1
    Hashcodes are allowed to collide - it's just a way to filter out obvious not equals. Commented Jun 6, 2018 at 15:21
  • @DanRayson: They're allowed to collide when computed for a key - but the OP was suggesting using the hash itself as the key, and that wouldn't work. Commented Jun 6, 2018 at 15:24
  • take a look at stackoverflow.com/questions/670063/… or stackoverflow.com/questions/8094867/… Depending on whether you care about order Commented Jun 6, 2018 at 15:25
  • @YairHalberstadt: I don't think the OP wants an order-agnostic comparison. Commented Jun 6, 2018 at 15:27

1 Answer 1

4

Rather than make the key the hash, make it the array itself - with a custom comparer. The value would be the notional "index".

The comparer doesn't need to be hugely efficient, nor does the hash generation need to go to great length to avoid duplicates, so long as there aren't too many collisions. (You should potentially add logging to check that.) Here's a really simple start:

public class Int32ArrayEqualityComparer : IEqualityComparer<int[]>
{
    // Note: SequenceEqual already checks the count before looking at content.
    public bool Equals(int[] first, int[] second) =>
        first.SequenceEqual(second);

    public int GetHashCode(int[] array)
    {
        unchecked
        {
            int hash = 23;
            foreach (var item in array)
            {
                hash = hash * 31 + item;
            }
            return hash;
        }
    }
}

You'd then create the dictionary like this:

var arrayMap = new Dictionary<int[], int>(new Int32ArrayEqualityComparer());

Then you'd have something like:

public int MaybeAddArray(int[] array)
{
    if (!arrayMap.TryGetValue(array, out var index))
    {
        index = arrayMap.Count + 1;
        arrayMap[array] = index;
    }
    return index;
}

Note that ConcurrentDictionary has simpler ways of doing this. Also note that the "index" is somewhat artificial here. You may not even need this, depending on what you're doing.

Sign up to request clarification or add additional context in comments.

6 Comments

I'm wondering if there's a mathematical way to know the correct number of elements from the array to use when generating the hashcode. Less means faster but more likely to need to check entire collection, More means slower but less likely to require full checking. Balancing act?
@DanRayson: It's only going to be once per array, so I suspect it's not much of an issue. The OP can always tweak that if necessary.
@DanRayson That's depend on the way the elements are generated. One way is to logically identify the portions of the arrays to most likely collide, and try to ignore those portions when generating the hash. An example might be to ignore Mand N in xxxxxxxx-xxxx-Mxxx-Nxxx-xxxxxxxxxxxx for GUIDs (a silly example since GUIDs are already a compact unique identifier.
When I say ignore, I mean compare them last - compare the items in the order of how likely they are to differ in value.
@cwharris: That can only be done in Equals - not in GetHashCode. I've just thought of a minor optimization though...
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.