Assume the following for each Python code:
import pandas as pd
import numpy as np
In Pandas, if I have a dataframe of 2 columns, one of which is an array of numbers, I can sum over the values of the array to get a single array.
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar'], 'numbers' : [np.array([1, 2, 3, 4]),np.array([2, 4, 2, 4]),np.array([2, 3, 4, 5]),np.array([1, 3, 5, 7])]} )
df['arrays'].sum()
I can even group by the first column and then sum over the second column to get sums for each group:
grpA = df.groupby('A')
grpA.sum()
However, if I have multiple other columns besides the array column, say 2 other columns, then I get a ValueError: Function does not reduce when trying to group by the first two columns and sum over the array column:
df2 = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar'],'B': ['la', 'la', 'al', 'al'],'numbers' : [np.array([1, 2, 3, 4]),np.array([2, 4, 2, 4]),np.array([2, 3, 4, 5]),np.array([1, 3, 5, 7])]} )
grpAB = df2.groupby(['A','B'])
grpAB.sum()
In SQL, the following would work if I could sum over arrays:
select A, B, sum(numbers)
from df2
group by A, B
Is there a way to successfully group by multiple columns and sum over the last array column in Pandas?