6

How do I calculate the mean for each of the below workerid's? Below is my sample NumPy ndarray. Column 0 is the workerid, column 1 is the latitude, and column 2 is the longitude.
I want to calculate the mean latitude and longitude for each workerid. I want to keep this all using NumPy (ndarray), without converting to Pandas.

import numpy
from scipy.spatial.distance import cdist, euclidean
import itertools
from itertools import groupby

class WorkerPatientScores:

    '''
    I read from the Patient and Worker tables in SchedulingOptimization.
    '''
    def __init__(self, dist_weight=1):
        self.a = []

        self.a = ([[25302, 32.133598100000000, -94.395845200000000],
                   [25302, 32.145095132560200, -94.358041585705600],
                   [25302, 32.160400000000000, -94.330700000000000],
                   [25305, 32.133598100000000, -94.395845200000000],
                   [25305, 32.115095132560200, -94.358041585705600],
                   [25305, 32.110400000000000, -94.330700000000000],
                   [25326, 32.123598100000000, -94.395845200000000],
                   [25326, 32.125095132560200, -94.358041585705600],
                   [25326, 32.120400000000000, -94.330700000000000],
                   [25341, 32.173598100000000, -94.395845200000000],
                   [25341, 32.175095132560200, -94.358041585705600],
                   [25341, 32.170400000000000, -94.330700000000000],
                   [25376, 32.153598100000000, -94.395845200000000],
                   [25376, 32.155095132560200, -94.358041585705600],
                   [25376, 32.150400000000000, -94.330700000000000]])

        ndarray = numpy.array(self.a)
        ndlist = ndarray.tolist()
        geo_tuple = [(p[1], p[2]) for p in ndlist]
        nd1 = numpy.array(geo_tuple)
        mean_tuple = numpy.mean(nd1, 0)
        print(mean_tuple)

The output of above is:

[ 32.14303108 -94.36152893]

0

4 Answers 4

7

Given this array, we want to group by the first columns and take the means of the other 2 columns

X = np.asarray([[25302, 32.133598100000000, -94.395845200000000],
                [25302, 32.145095132560200, -94.358041585705600],
                [25302, 32.160400000000000, -94.330700000000000],
                [25305, 32.133598100000000, -94.395845200000000],
                [25305, 32.115095132560200, -94.358041585705600],
                [25305, 32.110400000000000, -94.330700000000000],
                [25326, 32.123598100000000, -94.395845200000000],
                [25326, 32.125095132560200, -94.358041585705600],
                [25326, 32.120400000000000, -94.330700000000000],
                [25341, 32.173598100000000, -94.395845200000000],
                [25341, 32.175095132560200, -94.358041585705600],
                [25341, 32.170400000000000, -94.330700000000000],
                [25376, 32.153598100000000, -94.395845200000000],
                [25376, 32.155095132560200, -94.358041585705600],
                [25376, 32.150400000000000, -94.330700000000000]])

Using only numpy and without loops

groups = X[:,0].copy()
X = np.delete(X, 0, axis=1)

_ndx = np.argsort(groups)
_id, _pos, g_count  = np.unique(groups[_ndx], 
                                return_index=True, 
                                return_counts=True)

g_sum = np.add.reduceat(X[_ndx], _pos, axis=0)
g_mean = g_sum / g_count[:,None]

store results in dictionary:

>>> dict(zip(_id, g_mean))
{25302.0: array([ 32.14636441, -94.36152893]),
 25305.0: array([ 32.11969774, -94.36152893]),
 25326.0: array([ 32.12303108, -94.36152893]),
 25341.0: array([ 32.17303108, -94.36152893]),
 25376.0: array([ 32.15303108, -94.36152893])}
Sign up to request clarification or add additional context in comments.

1 Comment

👍 I had to realize for myself that the use of argsort is critical here (even though unique already sorts the labels) since the range indices given to reduceat must be contiguous. I think perhaps the delete is extraneous and may be replaced by X = X[_ndx, 1:] which keeps the rest more concise. Very clever! Excellent answer.
6

You can use some creative array slicing and the where function to solve this problem.

means = {}
for i in numpy.unique(a[:,0]):
    tmp = a[numpy.where(a[:,0] == i)]
    means[i] = (numpy.mean(tmp[:,1]), numpy.mean(tmp[:,2]))

The slice [:,0] is a handy way to extract a column (in this case the first) from a 2d array. To get the means, we find the unique IDs from the first column, then for each of those, we extract the appropriate rows with where, and combine. The end result is a dict of tuples, where the keys are the IDs and the values are a tuple containing the mean value of the other two columns. When I run it, it produces the following dict:

{25302.0: (32.1463644108534, -94.36152892856853),
 25305.0: (32.11969774418673, -94.36152892856853),
 25326.0: (32.12303107752007, -94.36152892856853),
 25341.0: (32.17303107752007, -94.36152892856853),
 25376.0: (32.15303107752007, -94.36152892856853)}

1 Comment

This is much less efficient than the answer @marco-cerliani - that answer removes the need for a Python loop (which is very slow compared to Numpy vectorization) and should be considered the correct answer.
3

Adding my two cents - although @Marco's answer is much more performant than the accepted answer and the suggestion I'm about to make - still, one can also use NumPy's histogram function to sum values given according to some grouping.

labels, bin_labels, bin_counts = np.unique(X[:,0], return_inverse=True, return_counts=True)
bins = np.arange(len(labels)+1)
i = 1  # 1 for the second column, 2 for the next, etc.
s = np.histogram(bin_labels, weights=X[:,i], bins=bins)[0]
mean = s / bin_counts

Now mean stands for the mean of each label mentioned in labels. Again, this is slower than np.add.reduceat but I provide it as an alternative as maybe it can serve other purposes...

1 Comment

This is similar in spirit to bincount which the doc describes thus: A possible use of bincount is to perform sums over variable-size chunks of an array, using the weights keyword.
2

Using workerid and a list comprehension it would be:

a=np.array(self.a)
ids=np.unique(a[:,0]) #array of unique ids
pos_mean=[np.mean(a[a[:,0]==i, 1:], axis=0) for i in ids]

But considering there always seem to be 3 consecutive measurements, there should be a relatively easy way to vectorize it

1 Comment

Awesome! I was able to add in the WorkerId by changing above code to: pos_mean=[(i, np.mean(a[a[:,0]==i, 1:], axis=0)) for i in ids]. But how can I remove "array" from the output?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.