4

I have a list of dataframes. Each dataframe was originally numerical data taken from which are all shaped identically with 21 rows and 5 columns. The first column is an index (index 0 to index 20). I want to compute the average (mean) values into a single dataframe. Then I want to export the dataframe to excel.

Here's a simplified version of my existing code:

#look to concatenate the dataframes together all at once
#dataFrameList is the given list of dataFrames
concatenatedDataframes = pd.concat(dataFrameList, axis = 1)

#grouping the dataframes by the index, which is the same across all of the dataframes
groupedByIndex = concatenatedDataframes.groupby(level = 0)

#take the mean 
meanDataFrame = groupedByIndex.mean()

# Create a Pandas Excel writer using openpyxl as the engine.
writer = pd.ExcelWriter(filepath, engine='openpyxl')
meanDataFrame.to_excel(writer)

However, when I open the excel file, I see what looks like EVERY dataframe is copied into the sheet and the average/mean values are not shown. A simplified example is shown below (cutting most of the rows and dataframes)

              Dataframe 1                   Dataframe 2                   Dataframe 3
Index  Col2   Col3   Col4   Col5     Col2   Col3   Col4   Col5     Col2   Col3   Col4   Col5
0      Data   Data   Data   Data     Data   Data   Data   Data     Data   Data   Data   Data
1      Data   Data   Data   Data     Data   Data   Data   Data     Data   Data   Data   Data
2      Data   Data   Data   Data     Data   Data   Data   Data     Data   Data   Data   Data
....

I'm looking for something more like:

           Averaged DF
Index  Col2                                   Col3                                   Col4
0      Mean Index0,Col2 across DFs    Mean Index0,Col3 across DFs    Mean Index0,Col4 across DFs
1      Mean Index1,Col2 across DFs    Mean Index1,Col3 across DFs    Mean Index1,Col4 across DFs
2      Mean Index2,Col2 across DFs    Mean Index2,Col3 across DFs    Mean Index3,Col4 across DFs
...

I have also already seen this answer: Get the mean across multiple Pandas DataFrames

If possible, I'm looking for a clean solution, not one which would simply involve looping through each dataFrame value by value. Any suggestions?

3 Answers 3

2

Perhaps I misunderstood what you asked

The solution is simple. You just need to concat along the correct axis

dummy data

df1 = pd.DataFrame(index=range(rows), columns=range(columns), data=[[10 + i * j for j in range(columns)] for i in range(rows) ])
df2 = df1 = pd.DataFrame(index=range(rows), columns=range(columns), data=[[i + j for j in range(columns)] for i in range(rows) ])

ps. this should be your job as OP

pd.concat

df_concat0 = pd.concat((df1, df2), axis=1)

puts all the dataframes next to eachother.

    0   1   0   1
0   10  10  0   1
1   10  11  1   2
2   10  12  2   3

If we want to do a groupby now, we first need to stack, groupby and stack again

df_concat0.stack().groupby(level=[0,1]).mean().unstack()

    0   1
0   5.0     5.5
1   5.5     6.5
2   6.0     7.5

If we do

df_concat = pd.concat((df1, df2))

This puts all the dataframes on top of eachother

    0   1
0   10  10
1   10  11
2   10  12
0   0   1
1   1   2
2   2   3

now we need to just groupby the index, like you did

df_concat.groupby(level=0).mean()

    0   1
0   5.0     5.5
1   5.5     6.5
2   6.0     7.5

and then use ExcelWriter as context manager

with pd.ExcelWriter(filepath, engine='openpyxl') as writer:
    result.to_excel(writer)

or just plain

result.to_excel(filepath, engine='openpyxl') 

if you can overwrite what is is filepath

Sign up to request clarification or add additional context in comments.

3 Comments

This seems to generate a Series, which isn't quite what I'm looking for
I adapted my answer, now you clarified what you needed
Perfect, exactly what I was looking for!
2

I suppose you need the mean of all rows against each column.

Concatenating a list of data frames with same index will add the columns from other data frames to the right of the first data frame. As below:

      col1  col2  col3  col1  col2  col3
    0     1     2     3     2     3     4
    1     2     3     4     3     4     5
    2     3     4     5     4     5     6
    3     4     5     6     5     6     7

Try appending the data frames and then group by and take the mean to get the desired result.

    ##creating data frames
    df1= pd.DataFrame({'col1':[1,2,3,4],
        'col2':[2,3,4,5],
        'col3':[3,4,5,6]})

    df2= pd.DataFrame({'col1':[2,3,4,5],
        'col2':[3,4,5,6],
        'col3':[4,5,6,7]})

    ## list of data frames
    dflist = [df1,df2]

    ## empty data frame to use for appending
    df=pd.DataFrame()

    #looping through each item in list and appending to empty data frame
    for i in dflist:
        df = df.append(i)

    # group by and calculating mean on index
    data_mean=df.groupby(level=0).mean()

Write to file as you are writing

Alternatively : Instead of appending using a for loop you can also mention the axis along which you want to concatenate the data frames, in your case you want to concatenate along the index(axis = 0) to put the data data frames on top top each other. As below:

       col1  col2  col3
    0     1     2     3
    1     2     3     4
    2     3     4     5
    3     4     5     6
    0     2     3     4
    1     3     4     5
    2     4     5     6
    3     5     6     7

    ##creating data frames
    df1= pd.DataFrame({'col1':[1,2,3,4],
                       'col2':[2,3,4,5],
                       'col3':[3,4,5,6]})

    df2= pd.DataFrame({'col1':[2,3,4,5],
                       'col2':[3,4,5,6],
                       'col3':[4,5,6,7]})

    ## list of data frames
    dflist = [df1,df2]

    #concat the dflist along axis 0 to put the data frames on top of each other
    df_concat=pd.concat(dflist,axis=0)

    # group by and calculating mean on index
    data_mean=df_concat.groupby(level=0).mean()

Write to file as you are writing

Comments

0
#First convert all your data frames to a numpy array, then use numpy  
#vector    based mean function. 
# import the two libraries 
import numpy as np
import pandas as pd
# convert to record arrays
ar_frame = [df.to_numpy(), for df in [df1, df2]]
# calculate the mean across the axis 0 for data frame if we want 
#for each row
data_mean = np.mean(ar_frame, axis =0)
# convert back to pandas if required. use any of the data frame 
# columns, 
#ofcourse this assumes that the columns names are the same across 
#data 
#frames
df_mean = pd.DataFrame(data_mean, columns = df.columns)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.