Applying a custom groupby aggregate function to find average of Numpy Array

Question

I am having a pandas DataFrame where B contains NumPy list of fixed size.

|------|---------------|-------|
|  A   |       B       |   C   |
|------|---------------|-------|
|  0   |   [2,3,5,6]   |   X   |
|------|---------------|-------|
|  1   |   [1,2,3,4]   |   X   |
|------|---------------|-------|
|  2   |   [2,3,6,5]   |   Y   |
|------|---------------|-------|
|  3   |   [2,3,2,3]   |   Y   |
|------|---------------|-------|
|  4   |   [2,3,4,4]   |   Y   |
|------|---------------|-------|
|  5   |   [2,3,5,6]   |   Z   |
|------|---------------|-------|

I want to group these by columns 'C' and calculate the average of values of 'B' as list. As the table given below. I want to do this efficiently.

|----------------|-------|
|        B       |   C   |
|----------------|-------|
|  [1.5,2.5,4,5] |   X   |
|----------------|-------|
|    [2,3,4,4]   |   Y   |
|----------------|-------|
|    [2,3,5,6]   |   Z   |
|----------------|-------|

I have considered breaking the NumPy list into individual columns. But that would be my last option.

How to write a custom aggregate function as right now column B is showing non-numeric and showing

DataError: No numeric types to aggregate

jezrael · Accepted Answer · 2020-04-25 08:39:25Z

3

What you need is possible with convert values to 2d array and then using np.mean:

f = lambda x: np.mean(np.array(x.tolist()), axis=0)
df2 = df.groupby('C')['B'].apply(f).reset_index()
print (df2)
   C                     B
0  X  [1.5, 2.5, 4.0, 5.0]
1  Y  [2.0, 3.0, 4.0, 4.0]
2  Z  [2.0, 3.0, 5.0, 6.0]

Last option solution is possible, but less effient (thank you @Abhik Sarkar for test):

df1 = pd.DataFrame(df.B.tolist()).groupby(df['C']).mean()
df2 = pd.DataFrame({'B': df1.values.tolist(), 'C': df1.index})
print (df2)
                      B  C
0  [1.5, 2.5, 4.0, 5.0]  X
1  [2.0, 3.0, 4.0, 4.0]  Y
2  [2.0, 3.0, 5.0, 6.0]  Z

edited Apr 25, 2020 at 8:39

answered Apr 25, 2020 at 8:09

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Abhik Sarkar Over a year ago

Well in my case the first method is 8 times faster.

jezrael Over a year ago

@AbhikSarkar - Super, I change answer.

moeelbedawi · Accepted Answer · 2020-05-13 07:19:43Z

2

Dummy data

size,list_size = 10,5
data = [{'C':random.randint(95,100), 
         'B':[random.randint(0,10) for i in range(list_size)]} for j in range(size)]
df = pd.DataFrame(data)

Custom Aggregation Using numpy

unique_C = df.C.unique()
data_calculated  = []
axis = 0

for c in unique_C:
    arr = np.reshape(np.hstack(df[df.C==c]['B']),(-1,list_size))
    mean, std = arr.mean(axis=axis), arr.std(axis=axis)  # other aggergation can also be added
    data_calculated.append(dict(C=t,B_mean=mean, B_std=std))
new_df = pd.DataFrame(data_calculated)

answered May 13, 2020 at 7:19

moeelbedawi

211 bronze badge

Collectives™ on Stack Overflow

Applying a custom groupby aggregate function to find average of Numpy Array

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related