Get mean of numpy array using pandas groupby

Question

I have a DataFrame where one column is a numpy array of numbers. For example,

import numpy as np
import pandas as pd

df = pd.DataFrame.from_dict({
    'id': [1, 1, 2, 2, 3, 3, 3, 4, 4],
    'data': [np.array([0.43, 0.32, 0.19]),
             np.array([0.41, 0.11, 0.21]),
             np.array([0.94, 0.35, 0.14]),
             np.array([0.78, 0.92, 0.45]),
             np.array([0.32, 0.63, 0.48]),
             np.array([0.17, 0.12, 0.15]),
             np.array([0.54, 0.12, 0.16]),
             np.array([0.48, 0.16, 0.19]),
             np.array([0.14, 0.47, 0.01])]
})

I want to groupby the id column and aggregate by taking the element-wise average of the array. Splitting the array up first is not feasible since it is length 300 and I have 200,000+ rows. When I do df.groupby('id').mean(), I get the error "No numeric types to aggregate". I am able to get an element-wise mean of the lists using df['data'].mean(), so I think there should be a way to do a grouped mean. To clarify, I want the output to be an array for each value of ID. Each element in the resulting array should be the mean of the values of the elements in the corresponding position within each group. In the example, the result should be:

pd.DataFrame.from_dict({
    'id': [1, 2,3,4],
    'data': [np.array([0.42, 0.215, 0.2]),
             np.array([0.86, 0.635, 0.29500000000000004]),
             np.array([0.3433333333333333, 0.29, 0.26333333333333336]),
             np.array([0.31, 0.315, 0.1])]
})

Could someone suggest how I might do this? Thanks!

What should the result look like?

James
– James

2021-09-03 19:34:15 +00:00
Commented Sep 3, 2021 at 19:34 — James
– James, Commented Sep 3, 2021 at 19:34

anky · Accepted Answer · 2021-09-03 19:40:54Z

4

Mean it twice, one at array level and once at group level:

df['data'].map(np.mean).groupby(df['id']).mean().reset_index()

   id      data
0   1  0.278333
1   2  0.596667
2   3  0.298889
3   4  0.241667

Based on comment, you can do:

pd.DataFrame(df['data'].tolist(),index=df['id']).mean(level=0).agg(np.array,1)

id
1                                 [0.42, 0.215, 0.2]
2                 [0.86, 0.635, 0.29500000000000004]
3    [0.3433333333333333, 0.29, 0.26333333333333336]
4                                 [0.31, 0.315, 0.1]
dtype: object

Or:

df.groupby("id")['data'].apply(np.mean)

edited Sep 3, 2021 at 19:40

answered Sep 3, 2021 at 19:30

anky

75.3k11 gold badges46 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

jmh123 Over a year ago

I should have been more clear, I want the output to be an array (in the example it would be length four), where each element is the mean of the elements in that position.

anky Over a year ago

@AndrejKesely Yes it does but with np.mean in my version, just tested and edited :-) only mean gives me DataError: No numeric types to aggregate

Henry Ecker Over a year ago

Using the level keyword in DataFrame and Series aggregations is deprecated Should be pd.DataFrame(df['data'].tolist(),index=df['id']).groupby(level=0).mean().agg(np.array,1) for Future versions.

ALollz · Accepted Answer · 2021-09-03 19:44:08Z

2

First, splitting up the array is feasible because your current storage requires storing a complex object of all the values within a DataFrame. This is going to take a lot more space than simply storing the flat 2D array

# Your current memory usage
df.memory_usage(deep=True).sum()
1352

# Create a new DataFrame (really just overwrite `df` but keep separate for illustration)
df1 = pd.concat([df['id'], pd.DataFrame(df['data'].tolist())], 1)
#   id     0     1     2
#0   1  0.43  0.32  0.19
#1   1  0.41  0.11  0.21
#2   2  0.94  0.35  0.14
#3   2  0.78  0.92  0.45
#4   3  0.32  0.63  0.48
#5   3  0.17  0.12  0.15
#6   3  0.54  0.12  0.16
#7   4  0.48  0.16  0.19
#8   4  0.14  0.47  0.01

Yes, this looks bigger, but it's not in terms of memory, it's actually smaller. The 3x factor here is a bit extreme, for larger DataFrames with long arrays it will probably be like 95% of the memory. Still it has to be less.

df1.memory_usage(deep=True).sum()
#416

And now your aggregation is a normal groupby + mean, columns give the location in the array

df1.groupby('id').mean()
#           0      1         2
#id                           
#1   0.420000  0.215  0.200000
#2   0.860000  0.635  0.295000
#3   0.343333  0.290  0.263333
#4   0.310000  0.315  0.100000

answered Sep 3, 2021 at 19:44

ALollz

59.7k7 gold badges73 silver badges97 bronze badges

2 Comments

anky Over a year ago

Awesome. I am guessing here but I think keeping id as index instead of concat might save some space. We can then refer to the level when grouping

ALollz Over a year ago

@anky, not sure about the memory and index, I haven't ever looked into it. Sadly the mean(level=) is deprecated because it seems like they're on a warpath to simplify the pandas api and remove all of the redundant ways to do the same operation (r.i.p. lookup :( ), so you'll still need .groupby('id').mean() (good think id can be in the index at least)

Mr. A · Accepted Answer · 2021-09-03 19:35:40Z

1

Group by mean for array where output is array of mean value

 df['data'].map(np.array).groupby(df['id']).mean().reset_index()

Output:

   id                                             data
0   1                               [0.42, 0.215, 0.2]
1   2               [0.86, 0.635, 0.29500000000000004]
2   3  [0.3433333333333333, 0.29, 0.26333333333333336]
3   4                               [0.31, 0.315, 0.1]

answered Sep 3, 2021 at 19:35

Mr. A

1,23118 silver badges28 bronze badges

3 Comments

anky Over a year ago

What is your pandas version, I am getting DataError: No numeric types to aggregate when I run this in '1.1.3'

Mr. A Over a year ago

pandas==1.3.2 numpy==1.21.2

anky Over a year ago

This is interesting as it is similar to my first answer but produces a different output. Cant say what pandas is changing with every version :/

James · Accepted Answer · 2021-09-03 19:37:23Z

1

You can always .apply the numpy mean.

df.groupby('id')['data'].apply(np.mean).apply(np.mean)

# returns:
id
1    0.278333
2    0.596667
3    0.298889
4    0.241667
Name: data, dtype: float64

answered Sep 3, 2021 at 19:37

James

37k4 gold badges54 silver badges79 bronze badges

Collectives™ on Stack Overflow

Get mean of numpy array using pandas groupby

4 Answers 4

3 Comments

2 Comments

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

2 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related