1

I have following two arrays of the same dimension of tags and tag categories. I want to group tags according to categories and count occurrences of tags.

As you can see tags can share same categories ('world', 'hello').

I know this can be easily done with loops but I'm sure numpy has some nifty ways of doing it more efficiently. Any help would be greatly appreciated.

# Tag category
A = [10, 10, 20, 10, 10, 10, 20, 10, 20, 20]
# Tags
B = ['hello', 'world', 'how', 'are', 'you', 'world', 'you', 'how', 'hello', 'hello']

Expected result:

[(10, (('hello', 1), ('are', 1), ('you', 1), ('world', 2))), (20, (('how', 1), ('you', 1), ('hello', 2)))]
1
  • Pandas may be more suitable for this. Commented Nov 7, 2014 at 15:16

5 Answers 5

2

You can use nested collections.defaultdict for this.

Here we are going to use the integers from A as key of the outer dict and and for each inner dict we'll use the words from B as key, and their value will be their count.

>>> from collections import defaultdict
>>> from pprint import pprint
>>> d = defaultdict(lambda: defaultdict(int))
>>> for k, v in zip(A, B):
        d[k][v] += 1

Now d contains(I converted it to normal dict, because its output is less confusing):

>>> pprint({k: dict(v) for k, v in d.items()})
{10: {'are': 1, 'hello': 1, 'how': 1, 'world': 2, 'you': 1},
 20: {'hello': 2, 'how': 1, 'you': 1}}

Now we need to loop through the outer dict and call tuple(.iteritems()) on the outer list to get the desired output:

>>> pprint([(k, tuple(v.iteritems())) for k, v in d.items()])
[(10, (('world', 2), ('you', 1), ('hello', 1), ('how', 1), ('are', 1))),
 (20, (('how', 1), ('you', 1), ('hello', 2)))]
Sign up to request clarification or add additional context in comments.

5 Comments

I'm not following this part defaultdict(lambda: defaultdict(int)) can you explain in more detail. thanks!
@marcin_koss This creates a nested dictionary structure, where the outermost keys will have a dictionary as value and the inner dictionary will have integer value(default 0).
Got it, thanks. Now, what would be the best way to also order tag tuples by count?
@marcin_koss You can replace v.iteritems() with sorted(v.iteritems(), key=itemgetter(1)), where itemgetter is operator.itemgetter.
Perfect, thanks you! I will look into Pandas as well.
2

Since it's been mentioned, here's a way to aggregate the values with Pandas.

Setting up a DataFrame...

>>> import pandas as pd
>>> df = pd.DataFrame({'A': A, 'B': B})
>>> df
    A      B
0  10  hello
1  10  world
2  20    how
3  10    are
4  10    you
5  10  world
6  20    you
7  10    how
8  20  hello
9  20  hello

Pivoting to aggregate values...

>>> table = pd.pivot_table(df, rows='B', cols='A', aggfunc='size')
>>> table
A      10  20
B            
are     1 NaN
hello   1   2
how     1   1
world   2 NaN
you     1   1

Converting back to a dictionary...

>>> table.to_dict()
{10: {'are': 1.0, 'hello': 1.0, 'how': 1.0, 'world': 2.0, 'you': 1.0},
 20: {'are': nan, 'hello': 2.0, 'how': 1.0, 'world': nan, 'you': 1.0}}

From here you could use Python to adjust the dictionary to a desired format (e.g. a list).

Comments

0

Here is one way:

>>> from collections import Counter
>>> A = np.array([10, 10, 20, 10, 10, 10, 20, 10, 20, 20])
>>> B = np.array(['hello', 'world', 'how', 'are', 'you', 'world', 'you', 'how', 'hello','hello'])
>>> [(i,Counter(B[np.where(A==i)]).items()) for i in set(A)]
[(10, [('world', 2), ('you', 1), ('hello', 1), ('how', 1), ('are', 1)]), (20, [('how', 1), ('you', 1), ('hello', 2)])]

1 Comment

This won't scale well as you're doing this in quadratic time.
0

but I'm sure numpy has some nifty ways of doing it more efficiently

and you're right! Here is the code

# convert to integer
category_lookup, categories = numpy.unique(A, return_inverse=True)
tag_lookup, tags = numpy.unique(B, return_inverse=True)

statistics = numpy.zeros([len(category_lookup), len(tag_lookup)])
numpy.add.at(statistics, [categories, tags], 1)

result = {}
for category, stat in zip(category_lookup, statistics):
    result[category] = dict(zip(tag_lookup[stat != 0], stat[stat != 0]))

For explanation see numpy tips and tricks. This gives expected answer:

{10: {'are': 1.0, 'hello': 1.0, 'how': 1.0, 'world': 2.0, 'you': 1.0}, 20: {'hello': 2.0, 'how': 1.0, 'you': 1.0}}

I shall admit, this is not the most clear way to do this (see pandas solution), but it is really fast when you have huge amount of data.

Comments

0

Python: NumPy Made Counting Occurrences Easy:

#import NumPy

import numpy as np

arr = np.array([0,1,2,2,3,3,7,3,4,0,4,4,0,4,5,0,5,9,5,9,5,8,5]) print(np.sum(arr==4)) #Test occurance of number 4

unique, counts = np.unique(arr,return_counts=True) print(unique,counts)

[0 1 2 3 4 5 7 8 9] [4 1 2 3 4 5 1 1 2]

The above is the output

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.