2

Assume the following for each Python code:

import pandas as pd
import numpy as np

In Pandas, if I have a dataframe of 2 columns, one of which is an array of numbers, I can sum over the values of the array to get a single array.

df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar'], 'numbers' : [np.array([1, 2, 3, 4]),np.array([2, 4, 2, 4]),np.array([2, 3, 4, 5]),np.array([1, 3, 5, 7])]} )
df['arrays'].sum()

I can even group by the first column and then sum over the second column to get sums for each group:

grpA = df.groupby('A')
grpA.sum()

However, if I have multiple other columns besides the array column, say 2 other columns, then I get a ValueError: Function does not reduce when trying to group by the first two columns and sum over the array column:

df2 = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar'],'B': ['la', 'la', 'al', 'al'],'numbers' : [np.array([1, 2, 3, 4]),np.array([2, 4, 2, 4]),np.array([2, 3, 4, 5]),np.array([1, 3, 5, 7])]} )
grpAB = df2.groupby(['A','B'])
grpAB.sum()

In SQL, the following would work if I could sum over arrays:

select A, B, sum(numbers)
    from df2
    group by A, B

Is there a way to successfully group by multiple columns and sum over the last array column in Pandas?

3 Answers 3

1

You can use a lambda expression. The iat expression takes the scalar value of the first element in the Series (here just the list of numbers), and then sums the results.

>>> df2.groupby(['A', 'B']).numbers.apply(lambda x: x.iat[0].sum())

A    B 
bar  al    16
     la    12
foo  al    14
     la    10
Name: numbers, dtype: int64
Sign up to request clarification or add additional context in comments.

Comments

0

A possible solution is

df2 = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar'],'B': ['la', 'la', 'al', 'al'],'numbers' : [np.array([1, 2, 3, 4]),np.array([2, 4, 2, 4]),np.array([2, 3, 4, 5]),np.array([1, 3, 5, 7])]} )

grouped = df2.groupby(['A','B'])

#set up empty arrays to append data from below loop
array=[]
index=[]

#loop through the grouped data and sum up the array numbers 
for i,j in grouped:
    array.append({'numbers':j.numbers.sum()})
    index.append(i)

#put summed array back into a dataframe 
print pd.DataFrame((array),index=index)  

2 Comments

Hi Tom, it doesn't look like this works. It outputs just one array and is equivalent to df2['array'].sum(). But you have given me an idea with apply. Let me see if I can figure something out.
Hi, sorry I misunderstood this problem - I have edited the answer and should be close to what you are looking for.
0
df2 = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar'],'B': ['la', 'la', 'al', 'al'],'numbers' : [np.array([1, 2, 3, 4]),np.array([2, 4, 2, 4]),np.array([2, 3, 4, 5]),np.array([1, 3, 5, 7])]} )


Out[42]:
     A   B    numbers
0   foo la  [1, 2, 3, 4]
1   bar la  [2, 4, 2, 4]
2   foo al  [2, 3, 4, 5]
3   bar al  [1, 3, 5, 7]

grpAB = df2.groupby(['A','B'])
res = grpAB.apply(lambda x : x.numbers.sum())


Out[43]:
A    B 
bar  al    [1, 3, 5, 7]
     la    [2, 4, 2, 4]
foo  al    [2, 3, 4, 5]
     la    [1, 2, 3, 4]
dtype: object

pd.DataFrame(res , columns = ['numbers'])


Out[44]:
numbers
A   B   
bar al  [1, 3, 5, 7]
    la  [2, 4, 2, 4]
foo al  [2, 3, 4, 5]
    la  [1, 2, 3, 4]
# if you want to reset the index
pd.DataFrame(res , columns = ['numbers']).reset_index()


Out[45]:
     A  B   numbers
0   bar al  [1, 3, 5, 7]
1   bar la  [2, 4, 2, 4]
2   foo al  [2, 3, 4, 5]
3   foo la  [1, 2, 3, 4]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.