1

I have a pretty standard function that seems to create very odd responses; I thought I had figured out what was going on but now I'm not so sure.

essentially, I'd like to use the rolling function to create a simple rolling average of the two values before. When I do this directly it seems to pull values from elsewhere in the Frame for the first numbers, and when I do it in a loop I have no idea where its coming from.

Sample data:

player  game_id game_order  TOI_comp    G_comp

A.J..GREER  2016020227  37  16.566667   0
2016020251  36  11.733333   0
2016020268  35  12.700000   0
2016020278  34  15.433333   0
2016020296  33  11.850000   0

player_avgs_base.sort_values(by=['player','game_order'],ascending=False, inplace=True)

avgtoi = player_avgs_base["TOI_comp"].rolling(2).mean().shift()
avgtoi

player         game_id     game_order
ZENON.KONOPKA  2013021047  2                   NaN
A.J..GREER     2016020268  35                  NaN
               2016020278  34             9.308333
               2016020296  33            14.066667
               2017020134  32            13.641667
               2017020149  31            10.108333
               2017020165  30             7.175000
               2017020194  29             6.100000

I would have expected more like

player         game_id     game_order
    A.J..GREER     2016020251  36                  NaN
                   2016020268  35                  NaN
                   2016020278  34                12.22 
                   2016020296  33            14.066667
                   2017020134  32            13.641667
                   2017020149  31            10.108333
1
  • maybe important to note, those first segments are index, set as player_avgs_base.set_index(["player",'game_id',"game_order"], inplace=True) Commented Sep 9, 2019 at 1:37

1 Answer 1

1

I think this is a sort problem. Can you please try, if this fixes your problem:

player_avgs_base.sort_values(["player","game_order"], ascending=False, inplace=True) 

If you like, you can set your index after executing the sort.

Another point is, that with your code, rolling doesn't respect the grouping. I guess you want to calculate your rolling sum per player, right and not mix in values of other players. If so, you can use the following code:

df2= df.sort_values(["player",'game_id',"game_order"])
df2['TOI_comp_avg_lt']= df2.groupby('player')['TOI_comp'].apply(lambda ser: ser.rolling(2).mean().shift())

This outputs:

          player     game_id  game_order   TOI_comp  G_comp  TOI_comp_avg_lt
0     A.J..GREER  2016020227          37  16.566667       0              NaN
2     A.J..GREER  2016020251          36  11.733333       0              NaN
4     A.J..GREER  2016020268          35  12.700000       0        14.150000
6     A.J..GREER  2016020278          34  15.433333       0        12.216666
7     A.J..GREER  2016020296          33  11.850000       0        14.066666
1  ZENON.KONOPKA  2013021047          34  12.666666       0              NaN
5  ZENON.KONOPKA  2013021047          35  14.722222       0              NaN
3  ZENON.KONOPKA  2013021047          37  13.111111       0        13.694444

For the following test data:

import pandas as pd
import io

raw= """A.J..GREER     2016020227  37  16.566667   0
ZENON.KONOPKA  2013021047  34  12.666666   0
A.J..GREER     2016020251  36  11.733333   0
ZENON.KONOPKA  2013021047  37  13.111111   0
A.J..GREER     2016020268  35  12.700000   0
ZENON.KONOPKA  2013021047  35  14.722222   0
A.J..GREER     2016020278  34  15.433333   0
A.J..GREER     2016020296  33  11.850000   0"""

df= pd.read_csv(io.StringIO(raw), sep='\s+', names=['player', 'game_id', 'game_order', 'TOI_comp', 'G_comp'])

Btw. your set_index is no replacement for the sort. The index has no effect on the output. E.g. if you use df as defined above and execute:

df_indexed= df.set_index(["player",'game_id',"game_order"]) 
df_indexed_result= df_indexed.copy()
df_indexed_result['TOI_comp_shifted']= df_indexed["TOI_comp"].shift()
df_indexed_result['TOI_comp_rolling_mean']= df_indexed["TOI_comp"].rolling(2).mean().shift()

You get:

                                      TOI_comp  G_comp  TOI_comp_shifted  TOI_comp_rolling_mean
player        game_id    game_order                                                            
A.J..GREER    2016020227 37          16.566667       0               NaN                    NaN
ZENON.KONOPKA 2013021047 34          12.666666       0         16.566667                    NaN
A.J..GREER    2016020251 36          11.733333       0         12.666666              14.616667
ZENON.KONOPKA 2013021047 37          13.111111       0         11.733333              12.200000
A.J..GREER    2016020268 35          12.700000       0         13.111111              12.422222
ZENON.KONOPKA 2013021047 35          14.722222       0         12.700000              12.905555
A.J..GREER    2016020278 34          15.433333       0         14.722222              13.711111
              2016020296 33          11.850000       0         15.433333              15.077777

If you look at the TOI_comp_shifted column, you recognize, that it is just filled with the value of the previous column, no matter which player it belongs to (the same is also true for the rolling mean). So the index has no effect for this operation.

To your second question. I think looping should work like this, if the column names of your dataframe are ok:

group_obj= df2.groupby('player')
for col in ['TOI_comp', 'G_comp']:
    df2[f'{col}_lt']= group_obj[col].apply(lambda ser: ser.rolling(2).mean().shift())

Assuming you want to apply the rolling-mean in the same way to a list of columns.

Sign up to request clarification or add additional context in comments.

10 Comments

I forgot to include that I had done player_avgs_base.sort_values(by=['player','game_order'],ascending=False, inplace=True) at the top, I'm adding that now. fully concur that the problem is it doesn't 'respect my grouping' (which is an excellent way to phrase it btw). going to try the lambda part.
Thank you for the info. Is the game_order unique per game_id, so each game_order identifes exactly one game? I changed the order accordingly above.
game order is unique to player. so for each player there's a game 1, which is most recent, up to game ___ representing their total games played. so the function is meant to generate for the last two games. currently I'm trying the apply(lambda ser: ser.rolling(2).mean().shift())
also on the frame itself I tried player_avgs_base.groupby('player').apply(lambda _df: _df.sort_values(by=['game_order']).TOI_comp.rolling(2).mean().shift()) which seemed to work (verifying currently). adding it to the loop failed. again the challenge is the loop is needed since I'm generating this across some 100 columns so writing the function out 100 times is very un-pythonic.
Thank you. It sounds like you have multiple columns with the same name. Have you already checked the output of df.dtypes? If the columns are ok, you can try the code above (see the last view lines) in case you want to apply the same logic to a list of columns.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.