0

I have a dataframe like this:

import pandas as pd

df = pd.DataFrame({
    'stuff_1_var_1': range(5),
    'stuff_1_var_2': range(2, 7),
    'stuff_2_var_1': range(3, 8),
    'stuff_2_var_2': range(5, 10)
})

   stuff_1_var_1  stuff_1_var_2  stuff_2_var_1  stuff_2_var_2
0              0              2              3              5
1              1              3              4              6

I would like to groupby based on the column headers and then add the mean and median of each group as new columns. So my expected output looks like this:

   stuff_1_var_mean  stuff_1_var_median  stuff_2_var_mean  stuff_2_var_median
0                 1                   1                 4                   4
1                 2                   2                 5                   5

Brief explanation: we have two groups stuff_1_var_ and stuff_2_var_ for which would calculate the mean and median per row. So, e.g. for stuff_1_var_ it would be:

# values from stuff_1_var_1 and stuff_1_var_2
(0 + 2) / 2 = 1 and 
( 1 + 3) / 2 = 2

The values are then added as a new column stuff_1_var_mean; analogue for meadian and stuff_2_var_.

I got until:

df = df.T

pattern = df.index.str.extract('(^stuff_\d_var_)', expand=False)

dfgb = df.groupby(pattern).agg(['mean', 'median']).T

          stuff_1_var_  stuff_2_var_
0 mean               1             4
  median             1             4
1 mean               2             5
  median             2             5

How can I do the final step(s)?

8
  • how would the output dataframe look like? the mean/median value will be duplicated across rows? Commented Feb 24, 2020 at 13:08
  • Your expected output is...confusing. Please clarify. Commented Feb 24, 2020 at 13:15
  • @YOLO: The desired output is included in the question; I added a bit more explanation; hope it is clearer now?! Commented Feb 24, 2020 at 13:44
  • @JoshFriedlander: I added a bit more explanation; hope it is clearer now?! Commented Feb 24, 2020 at 13:44
  • 1
    It doesn't mean you can't change it before proceeding... It can seem a lot of work at first but it's a really rewarding in the long run. Commented Feb 24, 2020 at 14:12

2 Answers 2

1

Your solution should be changed:

df = df.T

pattern = df.index.str.extract('(^stuff_\d_var_)', expand=False)

dfgb = df.groupby(pattern).agg(['mean', 'median']).T.unstack()
dfgb.columns = dfgb.columns.map(lambda x: f'{x[0]}{x[1]}')

print (dfgb)
   stuff_1_var_mean  stuff_1_var_median  stuff_2_var_mean  stuff_2_var_median
0                 1                   1                 4                   4
1                 2                   2                 5                   5
2                 3                   3                 6                   6
3                 4                   4                 7                   7
4                 5                   5                 8                   8

Unfortunately for axis=1 is not implemented agg, so possible solution is create mean and median separately and then concat:

dfgb = df.groupby(pattern, axis=1).agg(['mean','median'])

NotImplementedError: axis other than 0 is not supported

pattern = df.columns.str.extract('(^stuff_\d_var_)', expand=False)
g = df.groupby(pattern, axis=1)

dfgb = pd.concat([g.mean().add_suffix('mean'), 
                  g.median().add_suffix('median')], axis=1)
dfgb = dfgb.iloc[:, [0,2,1,3]]
print (dfgb)
   stuff_1_var_mean  stuff_1_var_median  stuff_2_var_mean  stuff_2_var_median
0                 1                   1                 4                   4
1                 2                   2                 5                   5
2                 3                   3                 6                   6
3                 4                   4                 7                   7
4                 5                   5                 8                   8
Sign up to request clarification or add additional context in comments.

4 Comments

That seems to do the trick, thanks again! Will wait with acceptance for a bit to check whether anything else shows up.
@Cleb - Is order of columns important?
@Cleb - Is possible first all means and then all median columns?
The order of columns is not important, no.
1

Here's a way you can do:

col = 'stuff_1_var_'
use_col = [x for x in df.columns if 'stuff_1' in x]

df[f'{col}mean'] = df[use_col].mean(1)
df[f'{col}median'] = df[use_col].median(1)

col2 = 'stuff_2_var_'
use_col = [x for x in df.columns if 'stuff_2' in x]

df[f'{col2}mean'] = df[use_col].mean(1)
df[f'{col2}median'] = df[use_col].median(1)

print(df.iloc[:,-4:]) # showing last four new columns

  stuff_1_var_mean  stuff_1_var_median  stuff_2_var_mean  stuff_2_var_median  
0               1.0                 1.0               4.0                 4.0  
1               2.0                 2.0               5.0                 5.0  
2               3.0                 3.0               6.0                 6.0  
3               4.0                 4.0               7.0                 7.0  
4               5.0                 5.0               8.0                 8.0  

Ofcourse, you can put it in a function to avoid repeating the same code.

1 Comment

Not as straightforward as @jezrael's answer, but still worth an upvote :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.