48

I've a data frame that looks like the following

x = pd.DataFrame({'user': ['a','a','b','b'], 'dt': ['2016-01-01','2016-01-02', '2016-01-05','2016-01-06'], 'val': [1,33,2,1]})

What I would like to be able to do is find the minimum and maximum date within the date column and expand that column to have all the dates there while simultaneously filling in 0 for the val column. So the desired output is

            dt user  val
0   2016-01-01    a    1
1   2016-01-02    a   33
2   2016-01-03    a    0
3   2016-01-04    a    0
4   2016-01-05    a    0
5   2016-01-06    a    0
6   2016-01-01    b    0
7   2016-01-02    b    0
8   2016-01-03    b    0
9   2016-01-04    b    0
10  2016-01-05    b    2
11  2016-01-06    b    1

I've tried the solution mentioned here and here but they aren't what I'm after. Any pointers much appreciated.

0

3 Answers 3

49

Initial Dataframe:

            dt  user    val
0   2016-01-01     a      1
1   2016-01-02     a     33
2   2016-01-05     b      2
3   2016-01-06     b      1

First, convert the dates to datetime:

x['dt'] = pd.to_datetime(x['dt'])

Then, generate the dates and unique users:

dates = x.set_index('dt').resample('D').asfreq().index

>> DatetimeIndex(['2016-01-01', '2016-01-02', '2016-01-03', '2016-01-04',
               '2016-01-05', '2016-01-06'],
              dtype='datetime64[ns]', name='dt', freq='D')

users = x['user'].unique()

>> array(['a', 'b'], dtype=object)

This will allow you to create a MultiIndex:

idx = pd.MultiIndex.from_product((dates, users), names=['dt', 'user'])

>> MultiIndex(levels=[[2016-01-01 00:00:00, 2016-01-02 00:00:00, 2016-01-03 00:00:00, 2016-01-04 00:00:00, 2016-01-05 00:00:00, 2016-01-06 00:00:00], ['a', 'b']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5], [0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1]],
           names=['dt', 'user'])

You can use that to reindex your DataFrame:

x.set_index(['dt', 'user']).reindex(idx, fill_value=0).reset_index()
Out: 
           dt user  val
0  2016-01-01    a    1
1  2016-01-01    b    0
2  2016-01-02    a   33
3  2016-01-02    b    0
4  2016-01-03    a    0
5  2016-01-03    b    0
6  2016-01-04    a    0
7  2016-01-04    b    0
8  2016-01-05    a    0
9  2016-01-05    b    2
10 2016-01-06    a    0
11 2016-01-06    b    1

which then can be sorted by users:

x.set_index(['dt', 'user']).reindex(idx, fill_value=0).reset_index().sort_values(by='user')
Out: 
           dt user  val
0  2016-01-01    a    1
2  2016-01-02    a   33
4  2016-01-03    a    0
6  2016-01-04    a    0
8  2016-01-05    a    0
10 2016-01-06    a    0
1  2016-01-01    b    0
3  2016-01-02    b    0
5  2016-01-03    b    0
7  2016-01-04    b    0
9  2016-01-05    b    2
11 2016-01-06    b    1
Sign up to request clarification or add additional context in comments.

6 Comments

In this case, all user share the same time range. What if each user has different date range [min, max]?
@luoshao23 .MultiIndex.from_product is a convenience function for the cartesian product but the index can also be constructed in other ways too. What kind of data do you have for this? How do you store min and max for each user for example? Does it come from a DataFrame?
@ayhan Yes, I make a date_range using min/max value, and then use .MultiIndex.from_product to generate multi level index for each user. Finally, I concat all the DataFrames to one DataFrame. It works while it takes much time. Do you have any idea to improve the efficiency?
@luoshao23 If all of them are coming from a single DataFrame, generating the multiindex separately and using reindex would probably more efficient. You can use MultiIndex.from_tuples for example. I think it would be better if you post a new question with a sample DataFrame.
This doesn't work when we have duplicate values of 'dt' in different groups. Ex: 2016-01-01 is present both for user- a & b. Below answer (@piRSquared) addreses for that as well. It's a one-liner with stack & unstack.
|
46

As @ayhan suggests

x.dt = pd.to_datetime(x.dt)

One-liner using mostly @ayhan's ideas while incorporating stack/unstack and fill_value

x.set_index(
    ['dt', 'user']
).unstack(
    fill_value=0
).asfreq(
    'D', fill_value=0
).stack().sort_index(level=1).reset_index()

           dt user  val
0  2016-01-01    a    1
1  2016-01-02    a   33
2  2016-01-03    a    0
3  2016-01-04    a    0
4  2016-01-05    a    0
5  2016-01-06    a    0
6  2016-01-01    b    0
7  2016-01-02    b    0
8  2016-01-03    b    0
9  2016-01-04    b    0
10 2016-01-05    b    2
11 2016-01-06    b    1

2 Comments

This solution is better than the accepted answer. It works even if there is a duplicate date across user. For example, if there was another row with "2016-01-01" and user "b", accepted solution would fail with an error - cannot reindex from a duplicate axis. But this solution works for such scenario too
How will this change if we need to do it using the min/max dates within the group ('user' in this case)? Basically generating/filling continuous dates within the group only.
2

An old question, with already excellent answers; this is an alternative, using the complete function from pyjanitor that could help with the abstraction when generating explicitly missing rows:

#pip install pyjanitor
import pandas as pd
import janitor as jn

 x['dt'] = pd.to_datetime(x['dt'])

# generate complete list of dates
dates = dict(dt = pd.date_range(x.dt.min(), x.dt.max(), freq='1D'))

# build the new dataframe, and fill nulls with 0
x.complete('user', dates, fill_value = 0)

   user         dt  val
0     a 2016-01-01    1
1     a 2016-01-02   33
2     a 2016-01-03    0
3     a 2016-01-04    0
4     a 2016-01-05    0
5     a 2016-01-06    0
6     b 2016-01-01    0
7     b 2016-01-02    0
8     b 2016-01-03    0
9     b 2016-01-04    0
10    b 2016-01-05    2
11    b 2016-01-06    1

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.