Numpy array: group by one column, sum another

Question

I have an array that looks like this:

 array([[ 0,  1,  2],
        [ 1,  1,  6],
        [ 2,  2, 10],
        [ 3,  2, 14]])

I want to sum the values of the third column that have the same value in the second column, so the result is something is:

 array([[ 0,  1,  8],
        [ 1,  2, 24]])

I started coding this but I'm stuck with this sum:

import numpy as np
import sys

inFile = sys.argv[1]

with open(inFile, 'r') as t:
    f = np.genfromtxt(t, delimiter=None, names =["1","2","3"])

f.sort(order=["1","2"])
if value == previous.value:
   sum(f["3"])

To clarify, it looks like your first column is a row number index, and doesn't have any regard for your data. You then want your second column to be the unique set of elements in that column, and the third column to be the sum of the existing third column for each of those set elements. Is your data already sorted by the second column, as it is in your example? — Scott Mermelstein
– Scott Mermelstein, Commented Mar 12, 2018 at 15:04
Yes, this is the data after being sorted. I just added a first column as indication that I have columns with "useless" information — Anom
– Anom, Commented Mar 12, 2018 at 15:09
Have you considered using pandas? It generates the index column and does grouping for you. — Mad Physicist
– Mad Physicist, Commented Mar 12, 2018 at 15:14
@Anom, if one of the below solutions helped, consider accepting it (green tick on left) so other users know. — jpp
– jpp, Commented Mar 22, 2018 at 13:18

Mad Physicist · Accepted Answer · 2018-03-12 16:32:47Z

7

If your data is sorted by the second column, you can use something centered around np.add.reduceat for a pure numpy solution. A combination of np.nonzero (or np.where) applied to np.diff will give you the locations where the second column switches values. You can use those indices to do the sum-reduction. The other columns are pretty formulaic, so you can concatenate them back in fairly easily:

A = np.array([[ 0,  1,  2],
              [ 1,  1,  6],
              [ 2,  2, 10],
              [ 3,  2, 14]])
# Find the split indices
i = np.nonzero(np.diff(A[:, 1]))[0] + 1
i = np.insert(i, 0, 0)
# Compute the result columns
c0 = np.arange(i.size)
c1 = A[i, 1]
c2 = np.add.reduceat(A[:, 2], i)
# Concatenate the columns
result = np.c_[c0, c1, c2]

IDEOne Link

Notice the +1 in the indices. That is because you always want the location after the switch, not before, given how reduceat works. The insertion of zero as the first index could also be accomplished with np.r_, np.concatenate, etc.

That being said, I still think you are looking for the pandas version in @jpp's answer.

edited Mar 12, 2018 at 16:32

answered Mar 12, 2018 at 15:48

Mad Physicist

116k29 gold badges202 silver badges292 bronze badges

Sign up to request clarification or add additional context in comments.

13 Comments

Anom Over a year ago

Hi, I think your solution is what I'm looking for but I have a problem with IndexError : too many indices. The array shape is (x,) (x is a number)

Mad Physicist Over a year ago

Are you passing in a 1D array somewhere without realizing it?

Anom Over a year ago

Yes, I checked that when I import the array from the external file and I assign names to the columns, it is changed from a 2D to a 1D

Ketil Tveiten Over a year ago

A comment regarding this solution vs pandas.groupby: this pure numpy solution is a lot faster.

Mad Physicist Over a year ago

@Ketil. Numpy generally tends to be a bit faster. But also less legible and harder to use for the more complex problems.

|

jpp · Accepted Answer · 2018-03-12 15:57:58Z

5

You can use pandas to vectorize your algorithm:

import pandas as pd, numpy as np

A = np.array([[ 0,  1,  2],
              [ 1,  1,  6],
              [ 2,  2, 10],
              [ 3,  2, 14]])

df = pd.DataFrame(A)\
       .groupby(1, as_index=False)\
       .sum()\
       .reset_index()

res = df[['index', 1, 2]].values

Result

array([[ 0,  1,  8],
       [ 2,  2, 24]], dtype=int64)

edited Mar 12, 2018 at 15:57

answered Mar 12, 2018 at 15:14

jpp

166k37 gold badges301 silver badges362 bronze badges

2 Comments

Mad Physicist Over a year ago

Mine on the other hand, is pure numpy. Mainly posted to convince OP to go with pandas :)

Mad Physicist Over a year ago

Also, I'm pretty sure OP is not looking for 0: 'first'. The first column is 0, 1, not 0, 2 in OP's expected result.

Mercury · Accepted Answer · 2022-02-21 08:16:29Z

A very neat, pure numpy solution is possible using np.histogram:

A = np.array([[ 0,  1,  2],
              [ 1,  1,  6],
              [ 2,  2, 10],
              [ 3,  2, 14]])

c1 = np.unique(A[:, 1])
c0 = np.arange(c1.shape[0])
c2 = np.histogram(A[:, 1], weights=A[:, 2], bins=c1.shape[0])[0]

result = np.c_[c0, c1, c2]

>>> result
array([[ 0,  1,  8],
       [ 1,  2, 24]])

When a weights array is provided (of the same shape as the input array) to np.histogram, any arbitrary element a[i] in the input array a will contribute weights[i] in the count for its bin.

So for example, we are counting the second column, and instead of counting 2 instances of 2, we get 10 instances of 2 + 14 instances of 2 = a count of 28 in 2's bin.

IMCoins · Accepted Answer · 2018-03-12 15:45:25Z

1

Here is my solution using only numpy arrays...

import numpy as np
arr = np.array([[ 0,  1,  2], [ 1,  1,  6], [ 2,  2, 10], [ 3,  2, 14]])

lst = []
compt = 0
for index in range(1, max(arr[:, 1]) + 1):
    lst.append([compt, index, np.sum(arr[arr[:, 1] == index][:, 2])])
lst = np.array(lst)
print lst
# lst, outputs...
# [[ 0  1  8]
# [ 0  2 24]]

The tricky part is the np.sum(arr[arr[:, 1] == index][:, 2]), so let's break it down to multiple parts.

arr[arr[:, 1] == index] means...

You have an array arr, on which we ask numpy the rows that matches the value of the for loop. Here, it is set from 1, to the maximum value of element of the 2nd column (meaning, column with index 1). Printing only this expression in the for loop results in...

# First iteration
[[0 1 2]
 [1 1 6]]
# Second iteration
[[ 2  2 10]
 [ 3  2 14]]

Adding [:, 2] to our expression, it means that we want the value of the 3rd column (meaning index 2), of our above lists. If I print arr[arr[:, 1] == index][:, 2], it would give me... [2, 6] at first iteration, and [10, 14] at the second.
I just need to sum these values using np.sum(), and to format my output list accordingly. :)

answered Mar 12, 2018 at 15:45

IMCoins

3,3161 gold badge13 silver badges27 bronze badges

4 Comments

Mad Physicist Over a year ago

You are looking for ufunc.reduceat. This can be done without loops. See my answer.

IMCoins Over a year ago

@MadPhysicist Well, this is why I am on this site : learning from others. I'm taking a look at it, thanks. :)

Mad Physicist Over a year ago

I remember the first time someone showed me reduceat. It's arcane but remarkably handy. In general, if you are using loops with numpy arrays, you probably need to reconsider your approach.

IMCoins Over a year ago

@MadPhysicist I already heard this, and I fully agree on it. Except, sometimes in cases like this, I don't have all day to look for a special function that would make my work easier. :'( -- This being said, I was looking for np.where() at first, but couldn't figure how to use it. Now I also know that I should have used it in association with np.diff(). :p

JahKnows · Accepted Answer · 2018-03-12 15:07:10Z

0

Using a dictionary to store the values and then converting back to a list

x = [[ 0,  1,  2],
     [ 1,  1,  6],
     [ 2,  2, 10],
     [ 3,  2, 14]]

y = {}
for val in x:
    if val[1] in y:
        y[val[1]][2] += val[2]
    else:
        y.update({val[1]: val})
print([y[val] for val in y])

answered Mar 12, 2018 at 15:07

JahKnows

2,7113 gold badges25 silver badges37 bronze badges

Comments

Hirabayashi Taro · Accepted Answer · 2018-03-12 15:34:20Z

0

You can also use a defaultdict and sum the values:

from collections import defaultdict

x = [[ 0,  1,  2],
    [ 1,  1,  6],
    [ 2,  2, 10]]

res = defaultdict(int)
for val in x:
    res[val[1]]+= val[2]
print ([[i, val,res[val]] for i, val in enumerate(res)])

answered Mar 12, 2018 at 15:34

Hirabayashi Taro

9439 silver badges18 bronze badges

4 Comments

ChatterOne Over a year ago

I think this is not guaranteed to keep the order of the original array (because dictionaries are not sorted)

Hirabayashi Taro Over a year ago

I was thinking the same, and I was actually surprised that with integer positive keys in python 3 I always got a sorted result.

hpaulj Over a year ago

As of a recent Python release, dictionaries are now ordered.

hpaulj Over a year ago

list(res.items()) can replace the last statement.

Mad Physicist · Accepted Answer · 2018-03-12 16:02:37Z

0

To get exact output use pandas:

import pandas as pd
import numpy as np

a = np.array([[ 0,  1,  2],
              [ 1,  1,  6],
              [ 2,  2, 10],
              [ 3,  2, 14]])

df = pd.DataFrame(a)
df.groupby(1).sum().reset_index().reset_index().as_matrix()
#[[ 0 1  8]
# [ 1 2 24]]

edited Mar 12, 2018 at 16:02

Mad Physicist

116k29 gold badges202 silver badges292 bronze badges

answered Mar 12, 2018 at 15:25

zipa

28k6 gold badges45 silver badges62 bronze badges

4 Comments

Mad Physicist Over a year ago

reset_index().reset_index()?

Mad Physicist Over a year ago

Also, you may want to do reset_index(inplace=True)

zipa Over a year ago

@MadPhysicist First one resets the groupby and second one adds new column that contains index values and matches the desired output.

Mad Physicist Over a year ago

The result is not what you claim it is... Take a look at your second column vs jpp's answer.

Collectives™ on Stack Overflow

Numpy array: group by one column, sum another

7 Answers 7

13 Comments

2 Comments

Comments

4 Comments

Comments

4 Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

13 Comments

2 Comments

Comments

4 Comments

Comments

4 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related