1

I have a DataFrame where one column is a numpy array of numbers. For example,

import numpy as np
import pandas as pd

df = pd.DataFrame.from_dict({
    'id': [1, 1, 2, 2, 3, 3, 3, 4, 4],
    'data': [np.array([0.43, 0.32, 0.19]),
             np.array([0.41, 0.11, 0.21]),
             np.array([0.94, 0.35, 0.14]),
             np.array([0.78, 0.92, 0.45]),
             np.array([0.32, 0.63, 0.48]),
             np.array([0.17, 0.12, 0.15]),
             np.array([0.54, 0.12, 0.16]),
             np.array([0.48, 0.16, 0.19]),
             np.array([0.14, 0.47, 0.01])]
})

I want to groupby the id column and aggregate by taking the element-wise average of the array. Splitting the array up first is not feasible since it is length 300 and I have 200,000+ rows. When I do df.groupby('id').mean(), I get the error "No numeric types to aggregate". I am able to get an element-wise mean of the lists using df['data'].mean(), so I think there should be a way to do a grouped mean. To clarify, I want the output to be an array for each value of ID. Each element in the resulting array should be the mean of the values of the elements in the corresponding position within each group. In the example, the result should be:

pd.DataFrame.from_dict({
    'id': [1, 2,3,4],
    'data': [np.array([0.42, 0.215, 0.2]),
             np.array([0.86, 0.635, 0.29500000000000004]),
             np.array([0.3433333333333333, 0.29, 0.26333333333333336]),
             np.array([0.31, 0.315, 0.1])]
})

Could someone suggest how I might do this? Thanks!

1
  • 1
    What should the result look like? Commented Sep 3, 2021 at 19:34

4 Answers 4

4

Mean it twice, one at array level and once at group level:

df['data'].map(np.mean).groupby(df['id']).mean().reset_index()

   id      data
0   1  0.278333
1   2  0.596667
2   3  0.298889
3   4  0.241667

Based on comment, you can do:

pd.DataFrame(df['data'].tolist(),index=df['id']).mean(level=0).agg(np.array,1)

id
1                                 [0.42, 0.215, 0.2]
2                 [0.86, 0.635, 0.29500000000000004]
3    [0.3433333333333333, 0.29, 0.26333333333333336]
4                                 [0.31, 0.315, 0.1]
dtype: object

Or:

df.groupby("id")['data'].apply(np.mean)
Sign up to request clarification or add additional context in comments.

3 Comments

I should have been more clear, I want the output to be an array (in the example it would be length four), where each element is the mean of the elements in that position.
@AndrejKesely Yes it does but with np.mean in my version, just tested and edited :-) only mean gives me DataError: No numeric types to aggregate
Using the level keyword in DataFrame and Series aggregations is deprecated Should be pd.DataFrame(df['data'].tolist(),index=df['id']).groupby(level=0).mean().agg(np.array,1) for Future versions.
2

First, splitting up the array is feasible because your current storage requires storing a complex object of all the values within a DataFrame. This is going to take a lot more space than simply storing the flat 2D array

# Your current memory usage
df.memory_usage(deep=True).sum()
1352

# Create a new DataFrame (really just overwrite `df` but keep separate for illustration)
df1 = pd.concat([df['id'], pd.DataFrame(df['data'].tolist())], 1)
#   id     0     1     2
#0   1  0.43  0.32  0.19
#1   1  0.41  0.11  0.21
#2   2  0.94  0.35  0.14
#3   2  0.78  0.92  0.45
#4   3  0.32  0.63  0.48
#5   3  0.17  0.12  0.15
#6   3  0.54  0.12  0.16
#7   4  0.48  0.16  0.19
#8   4  0.14  0.47  0.01

Yes, this looks bigger, but it's not in terms of memory, it's actually smaller. The 3x factor here is a bit extreme, for larger DataFrames with long arrays it will probably be like 95% of the memory. Still it has to be less.

df1.memory_usage(deep=True).sum()
#416

And now your aggregation is a normal groupby + mean, columns give the location in the array

df1.groupby('id').mean()
#           0      1         2
#id                           
#1   0.420000  0.215  0.200000
#2   0.860000  0.635  0.295000
#3   0.343333  0.290  0.263333
#4   0.310000  0.315  0.100000

2 Comments

Awesome. I am guessing here but I think keeping id as index instead of concat might save some space. We can then refer to the level when grouping
@anky, not sure about the memory and index, I haven't ever looked into it. Sadly the mean(level=) is deprecated because it seems like they're on a warpath to simplify the pandas api and remove all of the redundant ways to do the same operation (r.i.p. lookup :( ), so you'll still need .groupby('id').mean() (good think id can be in the index at least)
1

Group by mean for array where output is array of mean value

 df['data'].map(np.array).groupby(df['id']).mean().reset_index()

Output:

   id                                             data
0   1                               [0.42, 0.215, 0.2]
1   2               [0.86, 0.635, 0.29500000000000004]
2   3  [0.3433333333333333, 0.29, 0.26333333333333336]
3   4                               [0.31, 0.315, 0.1]

3 Comments

What is your pandas version, I am getting DataError: No numeric types to aggregate when I run this in '1.1.3'
pandas==1.3.2 numpy==1.21.2
This is interesting as it is similar to my first answer but produces a different output. Cant say what pandas is changing with every version :/
1

You can always .apply the numpy mean.

df.groupby('id')['data'].apply(np.mean).apply(np.mean)

# returns:
id
1    0.278333
2    0.596667
3    0.298889
4    0.241667
Name: data, dtype: float64

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.