It is common in data science to receive two equal length vectors (array of dimension 1), say Categories and Weights.
We aim to find all unique values of Categories and sum up the corresponding Weights. E.g.
Categories = ["abc", "def", "a", "a", "def"]
Weights = [ 1 , 2 , 1 , 10 , 1000]
Let's call our algorithm/function groupsum then we calling it
groupsum(Categories, Weights)
should give a result like
("abc" => 1, "def" => 1002, "a" => 11)
One algorithm is to loop through the Categories and build up a hashtable where the key is the hashed values of Categories, and the value is the cumulative values of the corresponding Weights. This is called an accumulator.
I wonder if there are even cleverer ways that deals with large amount of data e.g. think vectors that are ~2 billion+ elements long may
Or there specialised algorithms for special Categories data type e.g. if Categories are UInt8 then we can skip the hashmap and use an array instead to keep the counter.
Any links to research welcome!