Calculate mean numpy array in pandas DataFrame

Question

My DataFrame consists of numpy arrays as:

                                                col1  \
0  [[[0.878617777607, 0.712102459231, 0.652479557...   
1  [[[0.0815294305642, 0.793893471424, 0.24718091...   
2  [[[0.611498467162, 0.880551635123, 0.949764900...   

                                                col2  \
0  [[[0.390629506277, 0.0318899771374, 0.28308523...   
1  [[[0.578710371447, 0.385239304185, 0.330119601...   
2  [[[0.843661601339, 0.402833961663, 0.535083132...   

                                                col3  
0  [[[0.162446865578, 0.165619948624, 0.622459063...  
1  [[[0.859362904741, 0.415994003318, 0.706308170...  
2  [[[0.0559589731135, 0.307840549475, 0.80023067...

How can I calculate the mean numpy array in this DataFrame? The result should be a numpy array that represents the mean of all numpy arrays inside my DataFrame.

Code

import numpy as np
import pandas as pd

df = pd.DataFrame({'col1': [np.random.rand(4,4,4) for i in range(3)],
                   'col2': [np.random.rand(4,4,4) for i in range(3)],
                   'col3': [np.random.rand(4,4,4) for i in range(3)]})

Expected Output (For the code above): A numpy array that represents the mean of all numpy arrays

array([[[ 0.44091592,  0.81509111,  0.94968265,  0.60255149],
        [ 0.49263418,  0.69519008,  0.05023616,  0.67871942],
        [ 0.72771491,  0.9593636 ,  0.84673578,  0.43407915],
        [ 0.5884133 ,  0.63940507,  0.53364733,  0.51271129]],

       [[ 0.55612852,  0.58847166,  0.37781843,  0.7693527 ],
        [ 0.40610198,  0.05897461,  0.945253  ,  0.66332715],
        [ 0.74352406,  0.34969614,  0.50384616,  0.90582012],
        [ 0.38734233,  0.85533348,  0.94869219,  0.2863428 ]],

       [[ 0.81782769,  0.8856158 ,  0.68744406,  0.76579709],
        [ 0.05843924,  0.83090709,  0.99446694,  0.74937771],
        [ 0.11898717,  0.38715644,  0.50348724,  0.41903257],
        [ 0.21359555,  0.93407981,  0.20531033,  0.71017461]],

       [[ 0.88758803,  0.40433699,  0.02888434,  0.91075114],
        [ 0.84047283,  0.87119432,  0.14844659,  0.87643422],
        [ 0.06412383,  0.60458874,  0.47277274,  0.12969607],
        [ 0.31917517,  0.15647266,  0.89773897,  0.77962999]]])

I tried df.mean(), but it returns Series([], dtype: float64)

Also tried df.mean(axis=1).mean() and it returns NaN

UPDATE:

A much simpler example

df = pd.DataFrame({'col1': [np.array([[1,3],[4,2]]), np.array([[1,1],[3,2]])],
                   'col2': [np.array([[1,3],[3,3]]), np.array([[2,3],[3,1]])]})

DataFrame

Out[31]: 
               col1              col2
0  [[1, 3], [4, 2]]  [[1, 3], [3, 3]]
1  [[1, 1], [3, 2]]  [[2, 3], [3, 1]]

Expected output:

In[42]: (df.iloc[0,0]+df.iloc[0,1]+df.iloc[1,0]+df.iloc[1,1])/4.

Out[42]: 
array([[ 1.25,  2.5 ],
       [ 3.25,  2.  ]])

@ahed87 I'm dealing with DataFrames that contains columns with images in numpy arrays. — fabda01
– fabda01, Commented Nov 25, 2017 at 1:17
just wondered, df's give a quite big overhead if you already have np arrays — ahed87
– ahed87, Commented Nov 25, 2017 at 1:26

dgumo · Accepted Answer · 2017-11-25 00:41:23Z

1

Sorry, I misunderstood your question earlier, please try this.

df = pd.DataFrame({'col1': [np.array([[1.,3.],[4.,2.]]), np.array([[1.,1.],[3.,2.]])],
                   'col2': [np.array([[1.,3.],[3.,3.]]), np.array([[2.,3.],[3.,1.]])]})

print df
print np.expand_dims(df.as_matrix(), axis=1).mean()

edited Nov 25, 2017 at 0:41

answered Nov 24, 2017 at 23:08

dgumo

1,8681 gold badge14 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

fabda01 Over a year ago

For some reason, this is not converting to a single numpy array. Instead, it converts into an array of arrays, hence I can't use numpy arrays methods on it.

dgumo Over a year ago

Maybe i did not understand your question. Can you post a sample of expected output?

fabda01 Over a year ago

@dgnumo I also put the solution for a much simpler problem. The mean of my 2x2 DataFrame is calculated by (df.iloc[0,0]+df.iloc[0,1]+df.iloc[1,0]+df.iloc[1,1])/4.

dgumo Over a year ago

Updated the answer. Hope it helps.

AlexLoss · Accepted Answer · 2020-02-20 12:29:12Z

0

I don't know why pandas is allergic to computing mean() on a DataFrame, but here is a workaround:

>>> df = pd.DataFrame({'col1': [np.array([[1,3],[4,2]]), np.array([[1,1],[3,2]])],
...                    'col2': [np.array([[1,3],[3,3]]), np.array([[2,3],[3,1]])]})
>>> np.mean([df[col].mean() for col in df.columns], axis=0)
array([[1.25, 2.5 ],
       [3.25, 2.  ]])

Doing df.mean(axis=0).mean(axis=1) throws an exception:

ValueError: If using all scalar values, you must pass an index

answered Feb 20, 2020 at 12:29

AlexLoss

5934 silver badges17 bronze badges

Collectives™ on Stack Overflow

Calculate mean numpy array in pandas DataFrame

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related