1

I have an m-by-n NumPy array A, where each row represents an observation of some data. My rows are also assigned to one of c classes, and the class for each row is stored in an m-by-1 NumPy array B. I now want to compute the mean observation data M for each class. How can I do this?

For example:

A = numpy.array([[1, 2, 3], [1, 2, 3], [3, 4, 5], [4, 5, 6]])
B = numpy.array([1, 0, 0, 1]) # the first row is class 1, the second row is class 0 ...
M = # Do something

This should give me the output:

>>M
numpy.array([[2, 3, 4], [2.5, 3.5, 4.5]])

Here, row i in M is the mean for class i.

1
  • 2
    Are you wedded to using numpy? Working with data and observations is more a pandas problem; computing the means (as well other statistics) is a one-liner. Commented Oct 29, 2014 at 15:55

3 Answers 3

3

As mentioned in a comment, depending on where you want to go with this, pandas may be more useful. But right now this is still possible with numpy

import numpy
A = numpy.array([[1, 2, 3], [1, 2, 3], [3, 4, 5], [4, 5, 6]])
B = numpy.array([1, 0, 0, 1])

class_indicators = B[:, numpy.newaxis] == numpy.unique(B)
mean_operator = numpy.linalg.pinv(class_indicators.astype(float))

means = mean_operator.dot(A)

This example works for many classes etc, but as you see, this may be cumbersome

Sign up to request clarification or add additional context in comments.

Comments

2

Another way to do this using numpy's new at functionality.

A = numpy.array([[1, 2, 3], [1, 2, 3], [3, 4, 5], [4, 5, 6]])
B = numpy.array([1, 0, 0, 1])

u, uinds = numpy.unique(B, return_inverse=True)
M = numpy.zeros((u.shape[0], A.shape[-1]))
numpy.add.at(M, B, A)
M /= numpy.bincount(uinds)[:, None]

M
array([[ 2. ,  3. ,  4. ],
       [ 2.5,  3.5,  4.5]])

As mentioned pandas would make this easier:

import pandas as pd

>>> pd.DataFrame(A).groupby(B).mean()
     0    1    2
0  2.0  3.0  4.0
1  2.5  3.5  4.5

4 Comments

I like add.at! However M[B] += A shouldn't work with an index array, because the indexing creates a copy, and the addition is swallowed by it.
Hmm you are right, I think one of the beta's combined the functionality. A thorough reading of the documentation indicates that this is intended, thank you for pointing this out.
In any case, +1 for add.at, I really learnt something there!
Numpy 1.9 adds a return_counts kwarg to np.unique, so you can skip the call to np.bincount and do u, ucnts = numpy.unique(B, return_counts=True); ... ; M /= ucnts[:, None], which is also noticeably faster.
0

This is a typical grouping problem, which can be solved in a single line using the numpy_indexed package (disclaimer: I am its author):

import numpy_indexed as npi
npi.group_by(B).mean(A)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.