Most efficient way to do rolling window on a datetimeindex with a data offset from the index

Question

I am trying to calculate statistics over a shifted/offset rolling window of an inconsistent datetimeindex of a dataset in a pandas dataframe. I want to bring these statistics back to the current datetimeindex. I have a solution but it is computationally inefficient and impractical to run over my large dataset of millions of rows.

Here is a sample of what I want and my method to achieve it.

df = pd.DataFrame({'Col1': [10, 15, 20, 25, 30, 35, 40, 45, 50, 55]},
                   index=pd.DatetimeIndex(['2022-05-25T00:20:00.930','2022-05-25T00:20:01.257','2022-05-25T00:20:01.673','2022-05-25T00:20:03.125','2022-05-25T00:20:04.190',
                                           '2022-05-25T00:20:04.555','2022-05-25T00:20:04.923','2022-05-25T00:20:05.773','2022-05-25T00:20:05.989','2022-05-25T00:20:06.224'],yearfirst=True))

df:   
         Index             Col1
    2022-05-25 00:20:00.930    10
    2022-05-25 00:20:01.257    15
    2022-05-25 00:20:01.673    20
    2022-05-25 00:20:03.125    25
    2022-05-25 00:20:04.190    30
    2022-05-25 00:20:04.555    35
    2022-05-25 00:20:04.923    40
    2022-05-25 00:20:05.773    45
    2022-05-25 00:20:05.989    50
    2022-05-25 00:20:06.224    55

With the above dataset, this is my method to get a shifted rolling window at each index.

df['Col1 Avg'] = 0.0

for row in df.index:
   
    offset_t = datetime.timedelta(seconds=1.5)
    window_t = datetime.timedelta(seconds=1)
    beg = row-offset_t-window_t
    end = row-offset_t+window_t
    
    df['Col1 Avg'].loc[row:row] = df['Col1'].loc[beg:end].mean()

df:
              Index           Col1  Col1 Avg
    2022-05-25 00:20:00.930    10      NaN
    2022-05-25 00:20:01.257    15      NaN
    2022-05-25 00:20:01.673    20     10.0
    2022-05-25 00:20:03.125    25     15.0
    2022-05-25 00:20:04.190    30     25.0
    2022-05-25 00:20:04.555    35     25.0
    2022-05-25 00:20:04.923    40     27.5
    2022-05-25 00:20:05.773    45     35.0
    2022-05-25 00:20:05.989    50     35.0
    2022-05-25 00:20:06.224    55     35.0

Is there a way to do this more efficiently? This takes ~5 minutes for just 10,0000 rows whereas a standard rolling window is <0.05 seconds.

Something like this seems like it should work but doesn't (I think) because of the inconsistent datetimeindex entries.

df['shifted avg'] = df['Col1'].shift(-1,freq=offset_t).rolling('2s').mean()

df:

          Index            Col1 Col1 Avg    shifted avg
2022-05-25 00:20:00.930    10     NaN            NaN
2022-05-25 00:20:01.257    15     NaN            NaN
2022-05-25 00:20:01.673    20    10.0            NaN
2022-05-25 00:20:03.125    25    15.0            NaN
2022-05-25 00:20:04.190    30    25.0            NaN
2022-05-25 00:20:04.555    35    25.0            NaN
2022-05-25 00:20:04.923    40    27.5            NaN
2022-05-25 00:20:05.773    45    35.0            NaN
2022-05-25 00:20:05.989    50    35.0            NaN
2022-05-25 00:20:06.224    55    35.0            NaN

Chris · Accepted Answer · 2022-09-13 14:35:34Z

0

If you resample to 1ms, you can then take a 2 second rolling and offset by 500ms. Then since you have a record for every ms, and that's the resolution of your original index, you can merge them together to get the correct answers.

import pandas as pd

df = pd.DataFrame({'Col1': [10, 15, 20, 25, 30, 35, 40, 45, 50, 55]},
                   index=pd.DatetimeIndex(['2022-05-25T00:20:00.930','2022-05-25T00:20:01.257','2022-05-25T00:20:01.673','2022-05-25T00:20:03.125','2022-05-25T00:20:04.190',
                                           '2022-05-25T00:20:04.555','2022-05-25T00:20:04.923','2022-05-25T00:20:05.773','2022-05-25T00:20:05.989','2022-05-25T00:20:06.224'],yearfirst=True))


df = df.merge(df.resample('1ms')
                .min()
                .rolling('2S')
                .mean()
                .shift(500)
                .rename(columns={'Col1':'Col1 Avg'}),
              left_index=True, 
              right_index=True)

print(df)

Output

                        Col1  Col1 Avg
2022-05-25 00:20:00.930    10       NaN
2022-05-25 00:20:01.257    15       NaN
2022-05-25 00:20:01.673    20      10.0
2022-05-25 00:20:03.125    25      15.0
2022-05-25 00:20:04.190    30      25.0
2022-05-25 00:20:04.555    35      25.0
2022-05-25 00:20:04.923    40      27.5
2022-05-25 00:20:05.773    45      35.0
2022-05-25 00:20:05.989    50      35.0
2022-05-25 00:20:06.224    55      35.0

answered Sep 13, 2022 at 14:35

Chris

16.3k3 gold badges26 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Jiftim Over a year ago

This was perfect for my needs. Thanks for the quick response!

Jiftim Over a year ago

Do you know if there is a way to use .agg() in place of mean() to do multiple statistics and rename each column?

Jiftim Over a year ago

I figured out how to use .agg() with rolling incase anyone else runs across this post:

df = df.merge(resampled_df = df.resample('1ms').last().rolling('2S').agg({'Col1':['mean','std']}).droplevel(0,axis=1).shift(500).rename(columns={'mean':'Col1 mean','std':'Col1 std'}), left_index=True, right_index=True)

Collectives™ on Stack Overflow

Most efficient way to do rolling window on a datetimeindex with a data offset from the index

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related