Sum array by number in numpy

Question

Assuming I have a numpy array like: [1,2,3,4,5,6] and another array: [0,0,1,2,2,1] I want to sum the items in the first array by group (the second array) and obtain n-groups results in group number order (in this case the result would be [3, 9, 9]). How do I do this in numpy?

Why do you need numpy for this? Aren't you just using vanilla python lists? If not, what numpy type are you using? — Matt Ball
– Matt Ball, Commented Dec 7, 2010 at 5:19
I need numpy for this because I don't want to loop through the array n-times for n groups, since my array sizes can be arbitrarily large. I'm not using python lists, I was just showing an example data set in brackets. The datatype is int. — Scribble Master
– Scribble Master, Commented Dec 7, 2010 at 5:31

Alex · Accepted Answer · 2014-05-15 06:33:37Z

60

The numpy function bincount was made exactly for this purpose and I'm sure it will be much faster than the other methods for all sizes of inputs:

data = [1,2,3,4,5,6]
ids  = [0,0,1,2,2,1]

np.bincount(ids, weights=data) #returns [3,9,9] as a float64 array

The i-th element of the output is the sum of all the data elements corresponding to "id" i.

Hope that helps.

answered May 15, 2014 at 6:33

Alex

1,5022 gold badges11 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Jonathan Richards Over a year ago

Can confirm this is very fast. About 10 times faster than the sum_by_group method Bi Rico provided on small inputs.

zzh1996 Over a year ago

what if data are vectors?

Alex Over a year ago

It looks like the weights argument has to be 1-dimensional. One solution is to run bincount once for each dimension of the vector (i.e. twice if data is a set of 2-d vectors). A slight modification of Peter's answer should also work.

Mark_Anderson Over a year ago

Great method. Note that bincount requires int ids.

Simon P Over a year ago

Not that all the ids you expect to be present need to be present for this to make the most sense.

|

Ciro Santilli OurBigBook.com · Accepted Answer · 2016-09-09 09:30:36Z

30

This is a vectorized method of doing this sum based on the implementation of numpy.unique. According to my timings it is up to 500 times faster than the loop method and up to 100 times faster than the histogram method.

def sum_by_group(values, groups):
    order = np.argsort(groups)
    groups = groups[order]
    values = values[order]
    values.cumsum(out=values)
    index = np.ones(len(groups), 'bool')
    index[:-1] = groups[1:] != groups[:-1]
    values = values[index]
    groups = groups[index]
    values[1:] = values[1:] - values[:-1]
    return values, groups

edited Sep 9, 2016 at 9:30

Ciro Santilli OurBigBook.com

392k120 gold badges1.3k silver badges1.1k bronze badges

answered Jan 4, 2012 at 18:51

Bi Rico

25.9k3 gold badges57 silver badges75 bronze badges

Comments

Joe Kington · Accepted Answer · 2010-12-07 06:10:39Z

14

There's more than one way to do this, but here's one way:

import numpy as np
data = np.arange(1, 7)
groups = np.array([0,0,1,2,2,1])

unique_groups = np.unique(groups)
sums = []
for group in unique_groups:
    sums.append(data[groups == group].sum())

You can vectorize things so that there's no for loop at all, but I'd recommend against it. It becomes unreadable, and will require a couple of 2D temporary arrays, which could require large amounts of memory if you have a lot of data.

Edit: Here's one way you could entirely vectorize. Keep in mind that this may (and likely will) be slower than the version above. (And there may be a better way to vectorize this, but it's late and I'm tired, so this is just the first thing to pop into my head...)

However, keep in mind that this is a bad example... You're really better off (both in terms of speed and readability) with the loop above...

import numpy as np
data = np.arange(1, 7)
groups = np.array([0,0,1,2,2,1])

unique_groups = np.unique(groups)

# Forgive the bad naming here...
# I can't think of more descriptive variable names at the moment...
x, y = np.meshgrid(groups, unique_groups)
data_stack = np.tile(data, (unique_groups.size, 1))

data_in_group = np.zeros_like(data_stack)
data_in_group[x==y] = data_stack[x==y]

sums = data_in_group.sum(axis=1)

edited Dec 7, 2010 at 6:10

answered Dec 7, 2010 at 5:52

Joe Kington

287k73 gold badges621 silver badges474 bronze badges

7 Comments

Scribble Master Over a year ago

Thanks! Memory's not an issue and I'd like to avoid loops. How would you vectorize it?

Joe Kington Over a year ago

@Scribble Master - See the edit... There's nothing wrong with looping over the unique groups, though. The second version will probably be slow, and is damned hard to read. With the loop you're only looping (in python, anyway) over the number of unique groups. The inner comparison data[groups == group] will be quite fast.

Karl Knechtel Over a year ago

What dark magic is this data[groups == group] construct? Comparing an array to a scalar yields some kind of slice or view? o_O

Joe Kington Over a year ago

@Karl - groups == group yields a boolean array. You can index by arrays in numpy. This is a very common idiom in numpy (and Matlab). I find it quite readable (think of it as "where") and it's extremely useful.

Karl Knechtel Over a year ago

@Joe: Neat, but maybe a bit too magical for my liking. I haven't done very much with Numpy (haven't found as much need for it as I thought I might) - it will take some getting used to.

|

Sven Marnach · Accepted Answer · 2010-12-08 12:33:33Z

7

If the groups are indexed by consecutive integers, you can abuse the numpy.histogram() function to get the result:

data = numpy.arange(1, 7)
groups = numpy.array([0,0,1,2,2,1])
sums = numpy.histogram(groups, 
                       bins=numpy.arange(groups.min(), groups.max()+2), 
                       weights=data)[0]
# array([3, 9, 9])

This will avoid any Python loops.

answered Dec 8, 2010 at 12:33

Sven Marnach

608k123 gold badges966 silver badges865 bronze badges

Comments

caiohamamura · Accepted Answer · 2021-10-27 00:49:15Z

7

I tried scripts from everyone and my considerations are:

User	Comment
Joe	Will only work if you have few groups.
kevpie	Too slow because of loops (this is not pythonic way).
Bi_Rico and Sven	Nice performance, but will only work for Int32 (if the sum goes over 2^32/2 it will fail
Alex	Is the fastest one, the best solution for sum.

But if you want more flexibility and the possibility to group by other statistics use SciPy:

import numpy as np
from scipy import ndimage

data = np.arange(10000000)
unique_groups = np.arange(1000)
groups = unique_groups.repeat(10000)

ndimage.sum(data, groups, unique_groups)

This is good because you have many statistics to group (sum, mean, variance, ...).

edited Oct 27, 2021 at 0:49

answered Jul 25, 2014 at 14:39

caiohamamura

2,89025 silver badges27 bronze badges

1 Comment

notilas Over a year ago

This solution is quite neat.

Peter · Accepted Answer · 2016-09-09 09:59:45Z

5

You're all wrong! The best way to do it is:

a = [1,2,3,4,5,6]
ix = [0,0,1,2,2,1]
accum = np.zeros(np.max(ix)+1)
np.add.at(accum, ix, a)
print accum
> array([ 3.,  9.,  9.])

answered Sep 9, 2016 at 9:59

Peter

13.8k11 gold badges82 silver badges99 bronze badges

1 Comment

Peter Over a year ago

Actually you should probably just use Alex's np.bincount answer

Cleb · Accepted Answer · 2020-10-20 12:33:07Z

2

I noticed the numpy tag but in case you don't mind using pandas, this task becomes an one-liner:

import pandas as pd
import numpy as np

data = np.arange(1, 7)
groups = np.array([0, 0, 1, 2, 2, 1])

df = pd.DataFrame({'data': data, 'groups': groups})

So df then looks like this:

   data  groups
0     1       0
1     2       0
2     3       1
3     4       2
4     5       2
5     6       1

Now you can use the functions groupby() and sum()

print(df.groupby(['groups'], sort=False).sum())

which gives you the desired output

        data
groups      
0          3
1          9
2          9

By default, the dataframe would be sorted, therefore I use the flag sort=False which might improve speed for huge dataframes.

edited Oct 20, 2020 at 12:33

answered Jul 7, 2016 at 12:47

Cleb

26.2k23 gold badges128 silver badges164 bronze badges

Comments

Oleg Safronov · Accepted Answer · 2021-04-08 23:24:01Z

1

Also, note for Alex's answer:

data = [1,2,3,4,5,6]
ids  = [0,0,1,2,2,1]
np.bincount(ids, weights=data) #returns [3,9,9] as a float64 array

In case your indexes are not consecutive you might get stuck thinking why you keep getting a lot of zeros.

For instance:

data = [1,2,3,4,5,6]
ids  = [1,1,3,5,5,3]
np.bincount(ids, weights=data)

will give you:

array([0, 3, 0, 9, 0, 9])

which obviously means it builds all unique bins from 0 to max id in the list. And then return sums for each bin.

answered Apr 8, 2021 at 23:24

Oleg Safronov

111 silver badge2 bronze badges

Comments

wohlrajh · Accepted Answer · 2021-06-28 13:40:36Z

I tried different methods to do this and I found that indeed using np.bincount is the fastest. See Alex's answer

    import numpy as np
    import random
    import time
    
    size = 10000
    ngroups = 10
    
    groups = np.random.randint(low=0,high=ngroups,size=size)
    values = np.random.rand(size)
    
    
    # Test 1                                                                                                                                                                                                           
    beg = time.time()
    result = np.zeros(ngroups)
    for i in range(size):
        result[groups[i]] += values[i]
    print('Test 1 took:',time.time()-beg)
    
    # Test 2                                                                                                                                                                                                           
    beg = time.time()
    result = np.zeros(ngroups)
    for g,v in zip(groups,values):
        result[g] += v
    print('Test 2 took:',time.time()-beg)
    
    # Test 3                                                                                                                                                                                                           
    beg = time.time()
    result = np.zeros(ngroups)
    for g in np.unique(groups):
        wh = np.where(groups == g)
        result[g] = np.sum(values[wh[0]])
    print('Test 3 took:',time.time()-beg)
    
    
    # Test 4                                                                                                                                                                                                           
    beg = time.time()
    result = np.zeros(ngroups)
    for g in np.unique(groups):
        wh = groups == g
        result[g] = np.sum(values, where = wh)
    print('Test 4 took:',time.time()-beg)
    
    # Test 5                                                                                                                                                                                                           
    beg = time.time()
    result = np.array([np.sum(values[np.where(groups == g)[0]]) for g in np.unique(groups) ])
    print('Test 5 took:',time.time()-beg)
    
    # Test 6                                                                                                                                                                                                           
    beg = time.time()
    result = np.array([np.sum(values, where = groups == g) for g in np.unique(groups) ])
    print('Test 6 took:',time.time()-beg)
    
    # Test 7                                                                                                                                                                                                           
    beg = time.time()
    result = np.bincount(groups, weights = values)
    print('Test 7 took:',time.time()-beg)

Results:

    Test 1 took: 0.005615234375
    Test 2 took: 0.004812002182006836
    Test 3 took: 0.0006084442138671875
    Test 4 took: 0.0005099773406982422
    Test 5 took: 0.000499725341796875
    Test 6 took: 0.0004980564117431641
    Test 7 took: 1.9073486328125e-05

proto-n · Accepted Answer · 2022-03-02 11:31:02Z

0

Here's a method that works for summing objects of any dimension, grouped by values of any type (not only int):

grouping = np.array([1.1, 10, 1.1, 15])
to_sum = np.array([
    [1, 0],
    [0, 1],
    [0.5, 0.3],
    [2, 5],
])

groups, element_group_ixs = np.unique(grouping, return_inverse=True)
accum = np.zeros((groups.shape[0], *to_sum.shape[1:]))
np.add.at(accum, element_group_ixs, to_sum)

results in:

groups = array([ 1.1, 10. , 15. ])
accum = array([
    [1.5, 0.3],
    [0. , 1. ],
    [2. , 5. ]
])

(np.add.at idea taken from Peter's answer)

answered Mar 2, 2022 at 11:31

proto-n

6357 silver badges21 bronze badges

Comments

kevpie · Accepted Answer · 2010-12-07 06:47:30Z

-1

A pure python implementation:

l = [1,2,3,4,5,6]
g = [0,0,1,2,2,1]

from itertools import izip
from operator import itemgetter
from collections import defaultdict

def group_sum(l, g):
    groups = defaultdict(int)
    for li, gi in izip(l, g):
        groups[gi] += li
    return map(itemgetter(1), sorted(groups.iteritems()))

print group_sum(l, g)

[3, 9, 9]

answered Dec 7, 2010 at 6:47

kevpie

26.2k2 gold badges26 silver badges28 bronze badges

Collectives™ on Stack Overflow

Sum array by number in numpy

11 Answers 11

6 Comments

Comments

7 Comments

Comments

1 Comment

1 Comment

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

11 Answers 11

6 Comments

Comments

7 Comments

Comments

1 Comment

1 Comment

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related