Calculating means of a NumPy array by grouping rows

Question

I have an m-by-n NumPy array A, where each row represents an observation of some data. My rows are also assigned to one of c classes, and the class for each row is stored in an m-by-1 NumPy array B. I now want to compute the mean observation data M for each class. How can I do this?

For example:

A = numpy.array([[1, 2, 3], [1, 2, 3], [3, 4, 5], [4, 5, 6]])
B = numpy.array([1, 0, 0, 1]) # the first row is class 1, the second row is class 0 ...
M = # Do something

This should give me the output:

>>M
numpy.array([[2, 3, 4], [2.5, 3.5, 4.5]])

Here, row i in M is the mean for class i.

Are you wedded to using numpy? Working with data and observations is more a pandas problem; computing the means (as well other statistics) is a one-liner. — DSM
– DSM, Commented Oct 29, 2014 at 15:55

eickenberg · Accepted Answer · 2014-10-29 16:02:17Z

3

As mentioned in a comment, depending on where you want to go with this, pandas may be more useful. But right now this is still possible with numpy

import numpy
A = numpy.array([[1, 2, 3], [1, 2, 3], [3, 4, 5], [4, 5, 6]])
B = numpy.array([1, 0, 0, 1])

class_indicators = B[:, numpy.newaxis] == numpy.unique(B)
mean_operator = numpy.linalg.pinv(class_indicators.astype(float))

means = mean_operator.dot(A)

This example works for many classes etc, but as you see, this may be cumbersome

answered Oct 29, 2014 at 16:02

eickenberg

14.4k1 gold badge52 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Daniel · Accepted Answer · 2014-10-29 16:36:50Z

2

Another way to do this using numpy's new at functionality.

A = numpy.array([[1, 2, 3], [1, 2, 3], [3, 4, 5], [4, 5, 6]])
B = numpy.array([1, 0, 0, 1])

u, uinds = numpy.unique(B, return_inverse=True)
M = numpy.zeros((u.shape[0], A.shape[-1]))
numpy.add.at(M, B, A)
M /= numpy.bincount(uinds)[:, None]

M
array([[ 2. ,  3. ,  4. ],
       [ 2.5,  3.5,  4.5]])

As mentioned pandas would make this easier:

import pandas as pd

>>> pd.DataFrame(A).groupby(B).mean()
     0    1    2
0  2.0  3.0  4.0
1  2.5  3.5  4.5

edited Oct 29, 2014 at 16:36

answered Oct 29, 2014 at 16:11

Daniel

19.6k7 gold badges64 silver badges74 bronze badges

4 Comments

eickenberg Over a year ago

I like add.at! However M[B] += A shouldn't work with an index array, because the indexing creates a copy, and the addition is swallowed by it.

Daniel Over a year ago

Hmm you are right, I think one of the beta's combined the functionality. A thorough reading of the documentation indicates that this is intended, thank you for pointing this out.

eickenberg Over a year ago

In any case, +1 for add.at, I really learnt something there!

Jaime Over a year ago

Numpy 1.9 adds a return_counts kwarg to np.unique, so you can skip the call to np.bincount and do u, ucnts = numpy.unique(B, return_counts=True); ... ; M /= ucnts[:, None], which is also noticeably faster.

Eelco Hoogendoorn · Accepted Answer · 2016-04-03 18:22:29Z

0

This is a typical grouping problem, which can be solved in a single line using the numpy_indexed package (disclaimer: I am its author):

import numpy_indexed as npi
npi.group_by(B).mean(A)

answered Apr 3, 2016 at 18:22

Eelco Hoogendoorn

10.8k1 gold badge46 silver badges43 bronze badges

Collectives™ on Stack Overflow

Calculating means of a NumPy array by grouping rows

3 Answers 3

Comments

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related