Pandas filling missing dates and values within group

Question

I've a data frame that looks like the following

x = pd.DataFrame({'user': ['a','a','b','b'], 'dt': ['2016-01-01','2016-01-02', '2016-01-05','2016-01-06'], 'val': [1,33,2,1]})

What I would like to be able to do is find the minimum and maximum date within the date column and expand that column to have all the dates there while simultaneously filling in 0 for the val column. So the desired output is

            dt user  val
0   2016-01-01    a    1
1   2016-01-02    a   33
2   2016-01-03    a    0
3   2016-01-04    a    0
4   2016-01-05    a    0
5   2016-01-06    a    0
6   2016-01-01    b    0
7   2016-01-02    b    0
8   2016-01-03    b    0
9   2016-01-04    b    0
10  2016-01-05    b    2
11  2016-01-06    b    1

I've tried the solution mentioned here and here but they aren't what I'm after. Any pointers much appreciated.

C. Braun · Accepted Answer · 2019-04-05 14:38:35Z

49

Initial Dataframe:

            dt  user    val
0   2016-01-01     a      1
1   2016-01-02     a     33
2   2016-01-05     b      2
3   2016-01-06     b      1

First, convert the dates to datetime:

x['dt'] = pd.to_datetime(x['dt'])

Then, generate the dates and unique users:

dates = x.set_index('dt').resample('D').asfreq().index

>> DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
               '2016-01-05', '2016-01-06'],
              dtype='datetime64[ns]', name='dt', freq='D')

users = x['user'].unique()

>> array(['a', 'b'], dtype=object)

This will allow you to create a MultiIndex:

idx = pd.MultiIndex.from_product((dates, users), names=['dt', 'user'])

>> MultiIndex(levels=[[2016-01-01 00:00:00, 2016-01-02 00:00:00, 2016-01-03 00:00:00, 2016-01-04 00:00:00, 2016-01-05 00:00:00, 2016-01-06 00:00:00], ['a', 'b']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]],
           names=['dt', 'user'])

You can use that to reindex your DataFrame:

x.set_index(['dt', 'user']).reindex(idx, fill_value=0).reset_index()
Out: 
           dt user  val
0  2016-01-01    a    1
1  2016-01-01    b    0
2  2016-01-02    a   33
3  2016-01-02    b    0
4  2016-01-03    a    0
5  2016-01-03    b    0
6  2016-01-04    a    0
7  2016-01-04    b    0
8  2016-01-05    a    0
9  2016-01-05    b    2
10 2016-01-06    a    0
11 2016-01-06    b    1

which then can be sorted by users:

x.set_index(['dt', 'user']).reindex(idx, fill_value=0).reset_index().sort_values(by='user')
Out: 
           dt user  val
0  2016-01-01    a    1
2  2016-01-02    a   33
4  2016-01-03    a    0
6  2016-01-04    a    0
8  2016-01-05    a    0
10 2016-01-06    a    0
1  2016-01-01    b    0
3  2016-01-02    b    0
5  2016-01-03    b    0
7  2016-01-04    b    0
9  2016-01-05    b    2
11 2016-01-06    b    1

edited Apr 5, 2019 at 14:38

C. Braun

5,2611 gold badge22 silver badges51 bronze badges

answered Jul 7, 2017 at 19:29

user2285236

Sign up to request clarification or add additional context in comments.

6 Comments

luoshao23 Over a year ago

In this case, all user share the same time range. What if each user has different date range [min, max]?

user2285236 Over a year ago

@luoshao23 .MultiIndex.from_product is a convenience function for the cartesian product but the index can also be constructed in other ways too. What kind of data do you have for this? How do you store min and max for each user for example? Does it come from a DataFrame?

luoshao23 Over a year ago

@ayhan Yes, I make a date_range using min/max value, and then use .MultiIndex.from_product to generate multi level index for each user. Finally, I concat all the DataFrames to one DataFrame. It works while it takes much time. Do you have any idea to improve the efficiency?

user2285236 Over a year ago

@luoshao23 If all of them are coming from a single DataFrame, generating the multiindex separately and using reindex would probably more efficient. You can use MultiIndex.from_tuples for example. I think it would be better if you post a new question with a sample DataFrame.

user2458552 Over a year ago

This doesn't work when we have duplicate values of 'dt' in different groups. Ex: 2016-01-01 is present both for user- a & b. Below answer (@piRSquared) addreses for that as well. It's a one-liner with stack & unstack.

|

piRSquared · Accepted Answer · 2017-07-07 21:17:46Z

46

As @ayhan suggests

x.dt = pd.to_datetime(x.dt)

One-liner using mostly @ayhan's ideas while incorporating stack/unstack and fill_value

x.set_index(
    ['dt', 'user']
).unstack(
    fill_value=0
).asfreq(
    'D', fill_value=0
).stack().sort_index(level=1).reset_index()

           dt user  val
0  2016-01-01    a    1
1  2016-01-02    a   33
2  2016-01-03    a    0
3  2016-01-04    a    0
4  2016-01-05    a    0
5  2016-01-06    a    0
6  2016-01-01    b    0
7  2016-01-02    b    0
8  2016-01-03    b    0
9  2016-01-04    b    0
10 2016-01-05    b    2
11 2016-01-06    b    1

edited Jul 7, 2017 at 21:17

answered Jul 7, 2017 at 21:11

piRSquared

296k68 gold badges509 silver badges654 bronze badges

2 Comments

user131476 Over a year ago

This solution is better than the accepted answer. It works even if there is a duplicate date across user. For example, if there was another row with "2016-01-01" and user "b", accepted solution would fail with an error - cannot reindex from a duplicate axis. But this solution works for such scenario too

Vivek Payasi Over a year ago

How will this change if we need to do it using the min/max dates within the group ('user' in this case)? Basically generating/filling continuous dates within the group only.

sammywemmy · Accepted Answer · 2022-03-12 03:49:41Z

An old question, with already excellent answers; this is an alternative, using the complete function from pyjanitor that could help with the abstraction when generating explicitly missing rows:

#pip install pyjanitor
import pandas as pd
import janitor as jn

 x['dt'] = pd.to_datetime(x['dt'])

# generate complete list of dates
dates = dict(dt = pd.date_range(x.dt.min(), x.dt.max(), freq='1D'))

# build the new dataframe, and fill nulls with 0
x.complete('user', dates, fill_value = 0)

   user         dt  val
0     a 2016-01-01    1
1     a 2016-01-02   33
2     a 2016-01-03    0
3     a 2016-01-04    0
4     a 2016-01-05    0
5     a 2016-01-06    0
6     b 2016-01-01    0
7     b 2016-01-02    0
8     b 2016-01-03    0
9     b 2016-01-04    0
10    b 2016-01-05    2
11    b 2016-01-06    1

Collectives™ on Stack Overflow

Pandas filling missing dates and values within group

3 Answers 3

6 Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related