Pandas fill in missing date within each group with information in the previous row

Question

Similar question to this one, but with some modifications:

Instead of filling in missing dates for each group between the min and max date of the entire column, we only should be filling in the dates between the min and the max of that group, and output a dataframe with the last row in each group

Reproducible example:

x = pd.DataFrame({'dt': ['2016-01-01','2016-01-03', '2016-01-04','2016-01-01','2016-01-01','2016-01-04']
                    ,'amount': [10.0,30.0,40.0,78.0,80.0,82.0]
                    , 'sub_id': [1,1,1,2,2,2]
                    })

Visually:

            dt   sub_id   amount
0   2016-01-01        1     10.0
1   2016-01-03        1     30.0
2   2016-01-04        1     40.0
3   2017-01-01        2     78.0
4   2017-01-01        2     80.0
5   2017-01-04        2     82.0

Output I need:

            dt   sub_id   amount
0   2016-01-01        1     10.0
1   2016-01-02        1     10.0
2   2016-01-03        1     30.0
3   2016-01-04        1     40.0
4   2017-01-01        2     80.0
5   2017-01-02        2     80.0
6   2017-01-03        2     80.0
7   2017-01-04        2     82.0

We are grouping by dt and sub_id. As you can see, in sub_id=1, a row was added for 2016-01-02 and amount was imputed at 10.0 as the previous row was 10.0 (Assume data is sorted beforehand to enable this). For sub_id=2 row was added for 2017-01-02 and 2017-01-03 and amount is 80.0 as that was the last row before this date. The first row for 2017-01-01 was also deleted because we just want to keep the last row for each date and sub_id.

Looking for the most efficient way to do this as the real data has millions of rows. I have a current method using lambda functions and applying them across groups of sub_id but I feel like we could do better.

Thanks!

piRSquared · Accepted Answer · 2018-05-21 23:54:27Z

4

Getting the date right of course:

x.dt = pd.to_datetime(x.dt)

Then this:

cols = ['dt', 'sub_id']

pd.concat([
    d.asfreq('D').ffill(downcast='infer')
    for _, d in x.drop_duplicates(cols, keep='last')
                 .set_index('dt').groupby('sub_id')
]).reset_index()

          dt  amount  sub_id
0 2016-01-01      10       1
1 2016-01-02      10       1
2 2016-01-03      30       1
3 2016-01-04      40       1
4 2016-01-01      80       2
5 2016-01-02      80       2
6 2016-01-03      80       2
7 2016-01-04      82       2

edited May 21, 2018 at 23:54

answered May 21, 2018 at 23:35

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

AZhao Over a year ago

note this answer requires the data to be sorted by date to start!

BENY · Accepted Answer · 2018-05-21 23:09:49Z

4

By using resample with groupby

x.dt=pd.to_datetime(x.dt)
x.set_index('dt').groupby('sub_id').apply(lambda x : x.resample('D').max().ffill()).reset_index(level=1)
Out[265]: 
               dt  amount  sub_id
sub_id                           
1      2016-01-01    10.0     1.0
1      2016-01-02    10.0     1.0
1      2016-01-03    30.0     1.0
1      2016-01-04    40.0     1.0
2      2016-01-01    80.0     2.0
2      2016-01-02    80.0     2.0
2      2016-01-03    80.0     2.0
2      2016-01-04    82.0     2.0

answered May 21, 2018 at 23:09

BENY

324k22 gold badges176 silver badges250 bronze badges

2 Comments

user4505419 Over a year ago

Thanks man, this is very clean but I had a similar solution and because it uses apply, performance is quite slow. piRSquared' s solution worked best for me

user4505419 Over a year ago

I take above comment back. The performance difference between this and other solutions is due to the resample('D').max() step. Using drop duplicates first and then doing asfreq('D') is much faster

Haleemur Ali · Accepted Answer · 2018-05-21 23:49:49Z

use asfreq & groupby

first convert dt to datetime & get rid of duplicates

then for each group of sub_id use asfreq('D', method='ffill') to generate missing dates and impute amounts

finally reset_index on amount column as there's a duplicate sub_id column as well as index.

x.dt = pd.to_datetime(x.dt)
x.drop_duplicates(
  ['dt', 'sub_id'], 'last'
).groupby('sub_id').apply(
  lambda x: x.set_index('dt').asfreq('D', method='ffill')
).amount.reset_index()

# output:

   sub_id         dt  amount
0       1 2016-01-01    10.0
1       1 2016-01-02    10.0
2       1 2016-01-03    30.0
3       1 2016-01-04    40.0
4       2 2016-01-01    80.0
5       2 2016-01-02    80.0
6       2 2016-01-03    80.0
7       2 2016-01-04    82.0

Ido S · Accepted Answer · 2018-05-21 23:41:25Z

The below works for me and seems pretty efficient, but I can't say if it's efficient enough. It does avoid lambdas tho.

I called your data df.

Create a base_df with the entire date / sub_id grid:

import pandas as pd
from itertools import product

base_grid = product(pd.date_range(df['dt'].min(), df['dt'].max(), freq='D'), list(range(df['sub_id'].min(), df['sub_id'].max() + 1, 1)))

base_df = pd.DataFrame(list(base_grid), columns=['dt', 'sub_id'])

Get the max value per dt / sub_id from df:

max_value_df = df.loc[df.groupby(['dt', 'sub_id'])['amount'].idxmax()]
max_value_df['dt']  = max_value_df['dt'].apply(pd.Timestamp)

Merge base_df on the max values:

merged_df = base_df.merge(max_value_df, how='left', on=['dt', 'sub_id'])

Sort and forward fill the maximal value:

merged_df = merged_df.sort_values(by=['sub_id', 'dt', 'amount'], ascending=True)
merged_df['amount'] = merged_df.groupby(['sub_id'])['amount'].fillna(method='ffill')

Result:

    dt  sub_id  amount
0   2016-01-01  1   10.0
2   2016-01-02  1   10.0
4   2016-01-03  1   30.0
6   2016-01-04  1   40.0
1   2016-01-01  2   80.0
3   2016-01-02  2   80.0
5   2016-01-03  2   80.0
7   2016-01-04  2   82.0

Collectives™ on Stack Overflow

Pandas fill in missing date within each group with information in the previous row

4 Answers 4

1 Comment

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related