Pandas Dataframe containing Numpy ndarray and mean

Question

I have a Pandas dataframe containing Numpy ndarrays:

import numpy as np, pandas as pd
x = pd.DataFrame(columns=['a', 'b'])
x.loc['t1'] = [np.random.rand(2000, 500), np.random.rand(2000)]
x.loc['t2'] = [np.random.rand(2000, 500), np.random.rand(2000)]
x.loc['t3'] = [np.random.rand(2000, 500), np.random.rand(2000)]
print(x)
                                                    a                                                  b
# t1  [[0.8613174378493778, 0.5959214775442211, 0.62...  [0.4603835101674928, 0.3552761341266353, 0.949...
# t2  [[0.15792328922236398, 0.4274550633264813, 0.5...  [0.20059737978647396, 0.9445869962005252, 0.38...
# t3  [[0.43047697993868284, 0.7127140849172484, 0.4...  [0.6868215656323862, 0.14146376237438463, 0.51...

This works and computes the mean of the column b numpy arrays, over each of the 3 rows (vertical axis mean):

x.loc[:, 'b'].mean()
# [0.44926749 0.4804423  0.61566989 ... 0.4717142  0.70605732 0.55848075]

But how to compute the mean on the other axis? This fails:

x.loc[:, 'b'].mean(axis=1)   # or axis="b"

Expected result:

           b
t1         0.46
t2         0.31
t3         0.79

You cannot directly, you'd need to loop which defeats the purpose of using pandas/numpy, you should rather use a ndarray here for efficiency — mozway
– mozway, Commented Jun 24, 2022 at 9:39
@mozway Oh really, is that impossible? This is a shame because yes it would defeat the use of pandas/numpy together... ndarrays are great but not so much when we want to use labeled indexing. This means I should probably use xarray, as seen in stackoverflow.com/questions/72733385/…. BTW, your ideas welcome for this question! — Basj
– Basj, Commented Jun 24, 2022 at 9:43

Basj · Accepted Answer · 2022-06-24 10:02:11Z

1

You could always apply a mean function on the column, creating a new column in x, like this:

import numpy as np, pandas as pd
x = pd.DataFrame(columns=['a', 'b'])
x.loc['t1'] = [np.random.rand(2000, 500), np.random.rand(2000)]
x.loc['t2'] = [np.random.rand(2000, 500), np.random.rand(2000)]
x.loc['t3'] = [np.random.rand(2000, 500), np.random.rand(2000)]

x["b_mean"] = x["b"].apply(lambda y: np.mean(y))
# or just:
x["b_mean"] = x["b"].apply(np.mean)

Which results in:

t1    0.506371
t2    0.501433
t3    0.493867
Name: b_mean, dtype: float64

edited Jun 24, 2022 at 10:02

Basj

47.5k113 gold badges467 silver badges819 bronze badges

answered Jun 24, 2022 at 9:52

mabergerx

1,2337 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Pandas Dataframe containing Numpy ndarray and mean

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related