Grouping unique integer arrays

Question

One module in my app generates a small array of integers. Typically the size is 25 integers. The integers tend to be pretty small, less than 10000. I'll like to save all the unique arrays in a container of some sort. The number of arrays generated can be in the millions.

So, for every new array I need to figure out if it already exits. And if it does what's the index.

A naive approach is to keep all arrays in a list and then just call:

MyList.FindIndex(x=>x.SequenceEqual(Small_Array));

But this becomes very slow if the number of arrays is getting into the thousands.

A less naive approach is to store all arrays in a dictionary where the key is a hash value from the array. If the hash is just another integer (32bit) than I have cannot find a good hashing algorithm which doesn't collides.

Which, I think leaves me to using a hashing algorithm like MD5 that can be converted into a 128bit integer. Is that a good way to tackle my problem?

Why does it matter if the hashing algorithm collides or not? Why not just use the hash as a basis for deciding if you should check the sequence? — ProgrammingLlama
– ProgrammingLlama, Commented Jun 6, 2018 at 15:21
Hashcodes are allowed to collide - it's just a way to filter out obvious not equals. — Dan Rayson
– Dan Rayson, Commented Jun 6, 2018 at 15:21
@DanRayson: They're allowed to collide when computed for a key - but the OP was suggesting using the hash itself as the key, and that wouldn't work. — Jon Skeet
– Jon Skeet, Commented Jun 6, 2018 at 15:24
take a look at stackoverflow.com/questions/670063/… or stackoverflow.com/questions/8094867/… Depending on whether you care about order — Yair Halberstadt
– Yair Halberstadt, Commented Jun 6, 2018 at 15:25
@YairHalberstadt: I don't think the OP wants an order-agnostic comparison. — Jon Skeet
– Jon Skeet, Commented Jun 6, 2018 at 15:27

Jon Skeet · Accepted Answer · 2018-06-06 15:38:18Z

4

Rather than make the key the hash, make it the array itself - with a custom comparer. The value would be the notional "index".

The comparer doesn't need to be hugely efficient, nor does the hash generation need to go to great length to avoid duplicates, so long as there aren't too many collisions. (You should potentially add logging to check that.) Here's a really simple start:

public class Int32ArrayEqualityComparer : IEqualityComparer<int[]>
{
    // Note: SequenceEqual already checks the count before looking at content.
    public bool Equals(int[] first, int[] second) =>
        first.SequenceEqual(second);

    public int GetHashCode(int[] array)
    {
        unchecked
        {
            int hash = 23;
            foreach (var item in array)
            {
                hash = hash * 31 + item;
            }
            return hash;
        }
    }
}

You'd then create the dictionary like this:

var arrayMap = new Dictionary<int[], int>(new Int32ArrayEqualityComparer());

Then you'd have something like:

public int MaybeAddArray(int[] array)
{
    if (!arrayMap.TryGetValue(array, out var index))
    {
        index = arrayMap.Count + 1;
        arrayMap[array] = index;
    }
    return index;
}

Note that ConcurrentDictionary has simpler ways of doing this. Also note that the "index" is somewhat artificial here. You may not even need this, depending on what you're doing.

edited Jun 6, 2018 at 15:38

answered Jun 6, 2018 at 15:24

Jon Skeet

1.5m893 gold badges9.3k silver badges9.3k bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Dan Rayson Over a year ago

I'm wondering if there's a mathematical way to know the correct number of elements from the array to use when generating the hashcode. Less means faster but more likely to need to check entire collection, More means slower but less likely to require full checking. Balancing act?

Jon Skeet Over a year ago

@DanRayson: It's only going to be once per array, so I suspect it's not much of an issue. The OP can always tweak that if necessary.

cwharris Over a year ago

@DanRayson That's depend on the way the elements are generated. One way is to logically identify the portions of the arrays to most likely collide, and try to ignore those portions when generating the hash. An example might be to ignore Mand N in xxxxxxxx-xxxx-Mxxx-Nxxx-xxxxxxxxxxxx for GUIDs (a silly example since GUIDs are already a compact unique identifier.

cwharris Over a year ago

When I say ignore, I mean compare them last - compare the items in the order of how likely they are to differ in value.

Jon Skeet Over a year ago

@cwharris: That can only be done in Equals - not in GetHashCode. I've just thought of a minor optimization though...

|

Collectives™ on Stack Overflow

Grouping unique integer arrays

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related