0

My DataFrame consists of numpy arrays as:

                                                col1  \
0  [[[0.878617777607, 0.712102459231, 0.652479557...   
1  [[[0.0815294305642, 0.793893471424, 0.24718091...   
2  [[[0.611498467162, 0.880551635123, 0.949764900...   

                                                col2  \
0  [[[0.390629506277, 0.0318899771374, 0.28308523...   
1  [[[0.578710371447, 0.385239304185, 0.330119601...   
2  [[[0.843661601339, 0.402833961663, 0.535083132...   

                                                col3  
0  [[[0.162446865578, 0.165619948624, 0.622459063...  
1  [[[0.859362904741, 0.415994003318, 0.706308170...  
2  [[[0.0559589731135, 0.307840549475, 0.80023067...  

How can I calculate the mean numpy array in this DataFrame? The result should be a numpy array that represents the mean of all numpy arrays inside my DataFrame.

Code

import numpy as np
import pandas as pd

df = pd.DataFrame({'col1': [np.random.rand(4,4,4) for i in range(3)],
                   'col2': [np.random.rand(4,4,4) for i in range(3)],
                   'col3': [np.random.rand(4,4,4) for i in range(3)]})

Expected Output (For the code above): A numpy array that represents the mean of all numpy arrays

array([[[ 0.44091592,  0.81509111,  0.94968265,  0.60255149],
        [ 0.49263418,  0.69519008,  0.05023616,  0.67871942],
        [ 0.72771491,  0.9593636 ,  0.84673578,  0.43407915],
        [ 0.5884133 ,  0.63940507,  0.53364733,  0.51271129]],

       [[ 0.55612852,  0.58847166,  0.37781843,  0.7693527 ],
        [ 0.40610198,  0.05897461,  0.945253  ,  0.66332715],
        [ 0.74352406,  0.34969614,  0.50384616,  0.90582012],
        [ 0.38734233,  0.85533348,  0.94869219,  0.2863428 ]],

       [[ 0.81782769,  0.8856158 ,  0.68744406,  0.76579709],
        [ 0.05843924,  0.83090709,  0.99446694,  0.74937771],
        [ 0.11898717,  0.38715644,  0.50348724,  0.41903257],
        [ 0.21359555,  0.93407981,  0.20531033,  0.71017461]],

       [[ 0.88758803,  0.40433699,  0.02888434,  0.91075114],
        [ 0.84047283,  0.87119432,  0.14844659,  0.87643422],
        [ 0.06412383,  0.60458874,  0.47277274,  0.12969607],
        [ 0.31917517,  0.15647266,  0.89773897,  0.77962999]]])

I tried df.mean(), but it returns Series([], dtype: float64)

Also tried df.mean(axis=1).mean() and it returns NaN

UPDATE:

A much simpler example

df = pd.DataFrame({'col1': [np.array([[1,3],[4,2]]), np.array([[1,1],[3,2]])],
                   'col2': [np.array([[1,3],[3,3]]), np.array([[2,3],[3,1]])]})

DataFrame

Out[31]: 
               col1              col2
0  [[1, 3], [4, 2]]  [[1, 3], [3, 3]]
1  [[1, 1], [3, 2]]  [[2, 3], [3, 1]]

Expected output:

In[42]: (df.iloc[0,0]+df.iloc[0,1]+df.iloc[1,0]+df.iloc[1,1])/4.

Out[42]: 
array([[ 1.25,  2.5 ],
       [ 3.25,  2.  ]])
3
  • why do you want to organize your data this way? Commented Nov 25, 2017 at 1:07
  • @ahed87 I'm dealing with DataFrames that contains columns with images in numpy arrays. Commented Nov 25, 2017 at 1:17
  • 1
    just wondered, df's give a quite big overhead if you already have np arrays Commented Nov 25, 2017 at 1:26

2 Answers 2

1

Sorry, I misunderstood your question earlier, please try this.

df = pd.DataFrame({'col1': [np.array([[1.,3.],[4.,2.]]), np.array([[1.,1.],[3.,2.]])],
                   'col2': [np.array([[1.,3.],[3.,3.]]), np.array([[2.,3.],[3.,1.]])]})

print df
print np.expand_dims(df.as_matrix(), axis=1).mean()
Sign up to request clarification or add additional context in comments.

4 Comments

For some reason, this is not converting to a single numpy array. Instead, it converts into an array of arrays, hence I can't use numpy arrays methods on it.
Maybe i did not understand your question. Can you post a sample of expected output?
@dgnumo I also put the solution for a much simpler problem. The mean of my 2x2 DataFrame is calculated by (df.iloc[0,0]+df.iloc[0,1]+df.iloc[1,0]+df.iloc[1,1])/4.
Updated the answer. Hope it helps.
0

I don't know why pandas is allergic to computing mean() on a DataFrame, but here is a workaround:

>>> df = pd.DataFrame({'col1': [np.array([[1,3],[4,2]]), np.array([[1,1],[3,2]])],
...                    'col2': [np.array([[1,3],[3,3]]), np.array([[2,3],[3,1]])]})
>>> np.mean([df[col].mean() for col in df.columns], axis=0)
array([[1.25, 2.5 ],
       [3.25, 2.  ]])

Doing df.mean(axis=0).mean(axis=1) throws an exception:

ValueError: If using all scalar values, you must pass an index

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.