1

I am trying to calculate statistics over a shifted/offset rolling window of an inconsistent datetimeindex of a dataset in a pandas dataframe. I want to bring these statistics back to the current datetimeindex. I have a solution but it is computationally inefficient and impractical to run over my large dataset of millions of rows.

Here is a sample of what I want and my method to achieve it.

df = pd.DataFrame({'Col1': [10, 15, 20, 25, 30, 35, 40, 45, 50, 55]},
                   index=pd.DatetimeIndex(['2022-05-25T00:20:00.930','2022-05-25T00:20:01.257','2022-05-25T00:20:01.673','2022-05-25T00:20:03.125','2022-05-25T00:20:04.190',
                                           '2022-05-25T00:20:04.555','2022-05-25T00:20:04.923','2022-05-25T00:20:05.773','2022-05-25T00:20:05.989','2022-05-25T00:20:06.224'],yearfirst=True))

df:   
         Index             Col1
    2022-05-25 00:20:00.930    10
    2022-05-25 00:20:01.257    15
    2022-05-25 00:20:01.673    20
    2022-05-25 00:20:03.125    25
    2022-05-25 00:20:04.190    30
    2022-05-25 00:20:04.555    35
    2022-05-25 00:20:04.923    40
    2022-05-25 00:20:05.773    45
    2022-05-25 00:20:05.989    50
    2022-05-25 00:20:06.224    55

With the above dataset, this is my method to get a shifted rolling window at each index.

df['Col1 Avg'] = 0.0

for row in df.index:
   
    offset_t = datetime.timedelta(seconds=1.5)
    window_t = datetime.timedelta(seconds=1)
    beg = row-offset_t-window_t
    end = row-offset_t+window_t
    
    df['Col1 Avg'].loc[row:row] = df['Col1'].loc[beg:end].mean()

df:
              Index           Col1  Col1 Avg
    2022-05-25 00:20:00.930    10      NaN
    2022-05-25 00:20:01.257    15      NaN
    2022-05-25 00:20:01.673    20     10.0
    2022-05-25 00:20:03.125    25     15.0
    2022-05-25 00:20:04.190    30     25.0
    2022-05-25 00:20:04.555    35     25.0
    2022-05-25 00:20:04.923    40     27.5
    2022-05-25 00:20:05.773    45     35.0
    2022-05-25 00:20:05.989    50     35.0
    2022-05-25 00:20:06.224    55     35.0

Is there a way to do this more efficiently? This takes ~5 minutes for just 10,0000 rows whereas a standard rolling window is <0.05 seconds.

Something like this seems like it should work but doesn't (I think) because of the inconsistent datetimeindex entries.

df['shifted avg'] = df['Col1'].shift(-1,freq=offset_t).rolling('2s').mean()

df:

          Index            Col1 Col1 Avg    shifted avg
2022-05-25 00:20:00.930    10     NaN            NaN
2022-05-25 00:20:01.257    15     NaN            NaN
2022-05-25 00:20:01.673    20    10.0            NaN
2022-05-25 00:20:03.125    25    15.0            NaN
2022-05-25 00:20:04.190    30    25.0            NaN
2022-05-25 00:20:04.555    35    25.0            NaN
2022-05-25 00:20:04.923    40    27.5            NaN
2022-05-25 00:20:05.773    45    35.0            NaN
2022-05-25 00:20:05.989    50    35.0            NaN
2022-05-25 00:20:06.224    55    35.0            NaN

1 Answer 1

0

If you resample to 1ms, you can then take a 2 second rolling and offset by 500ms. Then since you have a record for every ms, and that's the resolution of your original index, you can merge them together to get the correct answers.

import pandas as pd

df = pd.DataFrame({'Col1': [10, 15, 20, 25, 30, 35, 40, 45, 50, 55]},
                   index=pd.DatetimeIndex(['2022-05-25T00:20:00.930','2022-05-25T00:20:01.257','2022-05-25T00:20:01.673','2022-05-25T00:20:03.125','2022-05-25T00:20:04.190',
                                           '2022-05-25T00:20:04.555','2022-05-25T00:20:04.923','2022-05-25T00:20:05.773','2022-05-25T00:20:05.989','2022-05-25T00:20:06.224'],yearfirst=True))


df = df.merge(df.resample('1ms')
                .min()
                .rolling('2S')
                .mean()
                .shift(500)
                .rename(columns={'Col1':'Col1 Avg'}),
              left_index=True, 
              right_index=True)

print(df)

Output

                        Col1  Col1 Avg
2022-05-25 00:20:00.930    10       NaN
2022-05-25 00:20:01.257    15       NaN
2022-05-25 00:20:01.673    20      10.0
2022-05-25 00:20:03.125    25      15.0
2022-05-25 00:20:04.190    30      25.0
2022-05-25 00:20:04.555    35      25.0
2022-05-25 00:20:04.923    40      27.5
2022-05-25 00:20:05.773    45      35.0
2022-05-25 00:20:05.989    50      35.0
2022-05-25 00:20:06.224    55      35.0
Sign up to request clarification or add additional context in comments.

3 Comments

This was perfect for my needs. Thanks for the quick response!
Do you know if there is a way to use .agg() in place of mean() to do multiple statistics and rename each column?
I figured out how to use .agg() with rolling incase anyone else runs across this post: df = df.merge(resampled_df = df.resample('1ms').last().rolling('2S').agg({'Col1':['mean','std']}).droplevel(0,axis=1).shift(500).rename(columns={'mean':'Col1 mean','std':'Col1 std'}), left_index=True, right_index=True)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.