Group by with numpy.mean

Question

How do I calculate the mean for each of the below workerid's? Below is my sample NumPy ndarray. Column 0 is the workerid, column 1 is the latitude, and column 2 is the longitude.
I want to calculate the mean latitude and longitude for each workerid. I want to keep this all using NumPy (ndarray), without converting to Pandas.

import numpy
from scipy.spatial.distance import cdist, euclidean
import itertools
from itertools import groupby

class WorkerPatientScores:

    '''
    I read from the Patient and Worker tables in SchedulingOptimization.
    '''
    def __init__(self, dist_weight=1):
        self.a = []

        self.a = ([[25302, 32.133598100000000, -94.395845200000000],
                   [25302, 32.145095132560200, -94.358041585705600],
                   [25302, 32.160400000000000, -94.330700000000000],
                   [25305, 32.133598100000000, -94.395845200000000],
                   [25305, 32.115095132560200, -94.358041585705600],
                   [25305, 32.110400000000000, -94.330700000000000],
                   [25326, 32.123598100000000, -94.395845200000000],
                   [25326, 32.125095132560200, -94.358041585705600],
                   [25326, 32.120400000000000, -94.330700000000000],
                   [25341, 32.173598100000000, -94.395845200000000],
                   [25341, 32.175095132560200, -94.358041585705600],
                   [25341, 32.170400000000000, -94.330700000000000],
                   [25376, 32.153598100000000, -94.395845200000000],
                   [25376, 32.155095132560200, -94.358041585705600],
                   [25376, 32.150400000000000, -94.330700000000000]])

        ndarray = numpy.array(self.a)
        ndlist = ndarray.tolist()
        geo_tuple = [(p[1], p[2]) for p in ndlist]
        nd1 = numpy.array(geo_tuple)
        mean_tuple = numpy.mean(nd1, 0)
        print(mean_tuple)

The output of above is:

[ 32.14303108 -94.36152893]

Marco Cerliani · Accepted Answer · 2021-10-03 14:25:56Z

7

Given this array, we want to group by the first columns and take the means of the other 2 columns

X = np.asarray([[25302, 32.133598100000000, -94.395845200000000],
                [25302, 32.145095132560200, -94.358041585705600],
                [25302, 32.160400000000000, -94.330700000000000],
                [25305, 32.133598100000000, -94.395845200000000],
                [25305, 32.115095132560200, -94.358041585705600],
                [25305, 32.110400000000000, -94.330700000000000],
                [25326, 32.123598100000000, -94.395845200000000],
                [25326, 32.125095132560200, -94.358041585705600],
                [25326, 32.120400000000000, -94.330700000000000],
                [25341, 32.173598100000000, -94.395845200000000],
                [25341, 32.175095132560200, -94.358041585705600],
                [25341, 32.170400000000000, -94.330700000000000],
                [25376, 32.153598100000000, -94.395845200000000],
                [25376, 32.155095132560200, -94.358041585705600],
                [25376, 32.150400000000000, -94.330700000000000]])

Using only numpy and without loops

groups = X[:,0].copy()
X = np.delete(X, 0, axis=1)

_ndx = np.argsort(groups)
_id, _pos, g_count  = np.unique(groups[_ndx], 
                                return_index=True, 
                                return_counts=True)

g_sum = np.add.reduceat(X[_ndx], _pos, axis=0)
g_mean = g_sum / g_count[:,None]

store results in dictionary:

>>> dict(zip(_id, g_mean))
{25302.0: array([ 32.14636441, -94.36152893]),
 25305.0: array([ 32.11969774, -94.36152893]),
 25326.0: array([ 32.12303108, -94.36152893]),
 25341.0: array([ 32.17303108, -94.36152893]),
 25376.0: array([ 32.15303108, -94.36152893])}

edited Oct 3, 2021 at 14:25

answered Mar 30, 2021 at 12:52

Marco Cerliani

22.1k3 gold badges58 silver badges60 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Yuval Over a year ago

👍 I had to realize for myself that the use of argsort is critical here (even though unique already sorts the labels) since the range indices given to reduceat must be contiguous. I think perhaps the delete is extraneous and may be replaced by X = X[_ndx, 1:] which keeps the rest more concise. Very clever! Excellent answer.

theB · Accepted Answer · 2018-06-20 14:31:28Z

6

You can use some creative array slicing and the where function to solve this problem.

means = {}
for i in numpy.unique(a[:,0]):
    tmp = a[numpy.where(a[:,0] == i)]
    means[i] = (numpy.mean(tmp[:,1]), numpy.mean(tmp[:,2]))

The slice [:,0] is a handy way to extract a column (in this case the first) from a 2d array. To get the means, we find the unique IDs from the first column, then for each of those, we extract the appropriate rows with where, and combine. The end result is a dict of tuples, where the keys are the IDs and the values are a tuple containing the mean value of the other two columns. When I run it, it produces the following dict:

{25302.0: (32.1463644108534, -94.36152892856853),
 25305.0: (32.11969774418673, -94.36152892856853),
 25326.0: (32.12303107752007, -94.36152892856853),
 25341.0: (32.17303107752007, -94.36152892856853),
 25376.0: (32.15303107752007, -94.36152892856853)}

answered Jun 20, 2018 at 14:31

theB

6,7781 gold badge33 silver badges42 bronze badges

1 Comment

Yuval Over a year ago

This is much less efficient than the answer @marco-cerliani - that answer removes the need for a Python loop (which is very slow compared to Numpy vectorization) and should be considered the correct answer.

Yuval · Accepted Answer · 2023-07-09 10:02:15Z

3

Adding my two cents - although @Marco's answer is much more performant than the accepted answer and the suggestion I'm about to make - still, one can also use NumPy's histogram function to sum values given according to some grouping.

labels, bin_labels, bin_counts = np.unique(X[:,0], return_inverse=True, return_counts=True)
bins = np.arange(len(labels)+1)
i = 1  # 1 for the second column, 2 for the next, etc.
s = np.histogram(bin_labels, weights=X[:,i], bins=bins)[0]
mean = s / bin_counts

Now mean stands for the mean of each label mentioned in labels. Again, this is slower than np.add.reduceat but I provide it as an alternative as maybe it can serve other purposes...

answered Jul 9, 2023 at 10:02

Yuval

3,5981 gold badge39 silver badges49 bronze badges

1 Comment

Reinderien Jun 9 at 15:59

This is similar in spirit to bincount which the doc describes thus: A possible use of bincount is to perform sums over variable-size chunks of an array, using the weights keyword.

Brenlla · Accepted Answer · 2018-06-20 14:22:21Z

2

Using workerid and a list comprehension it would be:

a=np.array(self.a)
ids=np.unique(a[:,0]) #array of unique ids
pos_mean=[np.mean(a[a[:,0]==i, 1:], axis=0) for i in ids]

But considering there always seem to be 3 consecutive measurements, there should be a relatively easy way to vectorize it

answered Jun 20, 2018 at 14:22

Brenlla

1,4811 gold badge11 silver badges24 bronze badges

1 Comment

salvationishere Over a year ago

Awesome! I was able to add in the WorkerId by changing above code to: pos_mean=[(i, np.mean(a[a[:,0]==i, 1:], axis=0)) for i in ids]. But how can I remove "array" from the output?

Collectives™ on Stack Overflow

Group by with numpy.mean

4 Answers 4

1 Comment

1 Comment

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

1 Comment

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related