frequency counting?

Ask Question

Asked 8 years, 1 month ago

Modified 8 years, 1 month ago

Viewed 627 times

It is common in data science to receive two equal length vectors (array of dimension 1), say Categories and Weights.

We aim to find all unique values of Categories and sum up the corresponding Weights. E.g.

Categories = ["abc", "def", "a", "a", "def"]
Weights =    [ 1   ,  2    , 1 ,  10 , 1000]

Let's call our algorithm/function groupsum then we calling it

groupsum(Categories, Weights)

should give a result like

("abc" => 1, "def" => 1002, "a" => 11)

One algorithm is to loop through the Categories and build up a hashtable where the key is the hashed values of Categories, and the value is the cumulative values of the corresponding Weights. This is called an accumulator.

I wonder if there are even cleverer ways that deals with large amount of data e.g. think vectors that are ~2 billion+ elements long may

Or there specialised algorithms for special Categories data type e.g. if Categories are UInt8 then we can skip the hashmap and use an array instead to keep the counter.

Any links to research welcome!

edited Oct 24, 2017 at 22:11

asked Oct 24, 2017 at 22:01

xiaodai

1313 bronze badges

$\begingroup$ I am not sure what is your expectation. One loop reads each entry once which cannot be improved. The clever way is to keep data categorized, agreggated or sorted from start, if this is not an option, the only way is to find efficiency somewhere. For example parallelize calculation, use fine-tuned hash for the data or sort it in the clever way, both would benefit from splitting the problem by first letters of the category (or more if it is not uniform). Such localization of data is more than enough to make it fast. If you could change category names to Uint8 then you know categories beforehand? $\endgroup$

Evil
– Evil

2017-10-25 00:01:08 +00:00
Commented Oct 25, 2017 at 0:01
$\begingroup$ @Evil expectation is that there is someone that maybe someone had come up with something no one has thought off and improves upon the loop-hashtable method. instead of saying thats the best we've got. perhaps there are algorithms for specific data types that are fast that i am not aware of $\endgroup$

xiaodai
– xiaodai

2017-10-25 01:06:16 +00:00
Commented Oct 25, 2017 at 1:06

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Stack Exchange Network

Algorithms for tabulating/counting/frequency counting?

0

Your Answer

Hot Network Questions

Algorithms for tabulating/counting/frequency counting?

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions