pandas dataframe fill missing rows with date column

Question

    uuid  site          ts  visit
0  +CW99  1124  2013-06-24     2
1  +CW99  1124  2013-06-26     1
2  +CW99  1124  2013-06-27     1
3  +CW99  1124  2013-06-20     1
4  +CW99  1124  2013-06-21     1
5  +CW99  1124  2013-06-24     2
6  +CW9W   956  2013-06-21     4
7  +CW9W   956  2013-06-22     2
8  +CW9W   956  2013-06-23     3
9  +CW9W   956  2013-06-24     4

I have a dataframe that is passed in looks like that. I don't really know how many distinct ts period are there. So I want to find the min and max of the ts then expands the rows that does not have all the ts (daily) to zero fill the visit column while copying the rest of the fields

        uuid  site          ts  visit
    0  +CW99  1124  2013-06-20     1
    1  +CW99  1124  2013-06-21     1
    2  +CW99  1124  2013-06-22     1
    3  +CW99  1124  2013-06-23     0
    4  +CW99  1124  2013-06-24     2
    5  +CW99  1124  2013-06-25     0
    6  +CW99  1124  2013-06-26     1
    7  +CW99  1124  2013-06-27     1
    8  +CW9W   956  2013-06-20     0
    9  +CW9W   956  2013-06-21     4
   10  +CW9W   956  2013-06-22     2
   11  +CW9W   956  2013-06-23     3
   12  +CW9W   956  2013-06-24     4
   13  +CW9W   956  2013-06-25     0
   14  +CW9W   956  2013-06-26     0
   15  +CW9W   956  2013-06-27     0

Thanks.

EdChum · Accepted Answer · 2013-08-17 13:24:31Z

4

This is non-trivial. I'll explain why below

Prelim, read in the original data frame and make sure that the ts column is a dtype of datetime64[ns]

# you may need to do this to get the correct dtype   
df['ts'] = df['ts'].to_datetime(df['ts'])

In [107]: df
Out[107]: 
    uuid  site                  ts  visit
0  +CW99  1124 2013-06-24 00:00:00      2
1  +CW99  1124 2013-06-26 00:00:00      1
2  +CW99  1124 2013-06-27 00:00:00      1
3  +CW99  1124 2013-06-20 00:00:00      1
4  +CW99  1124 2013-06-21 00:00:00      1
5  +CW99  1124 2013-06-24 00:00:00      2
6  +CW9W   956 2013-06-21 00:00:00      4
7  +CW9W   956 2013-06-22 00:00:00      2
8  +CW9W   956 2013-06-23 00:00:00      3
9  +CW9W   956 2013-06-24 00:00:00      4

In [106]: df.dtypes
Out[106]: 
uuid             object
site              int64
ts       datetime64[ns]
visit             int64
dtype: object

Create the master times between min and max

In [110]: all_ts = pd.date_range(df['ts'].min(),df['ts'].max())

In [111]: all_ts
Out[111]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-06-20 00:00:00, ..., 2013-06-27 00:00:00]
Length: 8, Freq: D, Timezone: None

Define a function like this

In [103]: def f(x):

              # we want all of the ``ts`` column that are not in the master time series
   .....:     adf = DataFrame(dict(ts = all_ts-Index(x['ts'])),columns=df.columns)

              # they should have visit of 0
   .....:     adf['visit'] = 0

              # first add them to the frame (x), ignoring the index
              # sort by the ts column
              # then fillforward missing values
   .....:     return x.append(adf,ignore_index=True).sort_index(by='ts').ffill()
   .....:

Apply the function (you could group by uuid,site if you want as well)

In [116]: df.groupby('uuid').apply(f)
Out[116]: 
          uuid  site                  ts  visit
uuid                                           
+CW99 3  +CW99  1124 2013-06-20 00:00:00      1
      4  +CW99  1124 2013-06-21 00:00:00      1
      0  +CW99  1124 2013-06-24 00:00:00      2
      5  +CW99  1124 2013-06-24 00:00:00      2
      6  +CW99  1124 2013-06-25 00:00:00      0
      1  +CW99  1124 2013-06-26 00:00:00      1
      2  +CW99  1124 2013-06-27 00:00:00      1
+CW9W 0  +CW9W   956 2013-06-21 00:00:00      4
      1  +CW9W   956 2013-06-22 00:00:00      2
      2  +CW9W   956 2013-06-23 00:00:00      3
      3  +CW9W   956 2013-06-24 00:00:00      4
      4  +CW9W   956 2013-06-25 00:00:00      0

Note: you have a duplicate in the posted frame. Not sure if that was intential or not I preserved that. This is a somewhat easier problem if you don't have duplicates (in the ts column)

Here is the no dup way

In [207]: def f(x):
   .....:     x = x.set_index('ts').reindex(all_ts).reset_index()
   .....:     x['visit'] = x['visit'].fillna(0)
   .....:     return x.ffill()
   .....: 

In [208]: df_no_dups.groupby('uuid').apply(f)
Out[208]: 
                      index   uuid  site  visit
uuid                                           
+CW99 0 2013-06-20 00:00:00  +CW99  1124      1
      1 2013-06-21 00:00:00  +CW99  1124      1
      2 2013-06-22 00:00:00  +CW99  1124      0
      3 2013-06-23 00:00:00  +CW99  1124      0
      4 2013-06-24 00:00:00  +CW99  1124      2
      5 2013-06-25 00:00:00  +CW99  1124      0
      6 2013-06-26 00:00:00  +CW99  1124      1
      7 2013-06-27 00:00:00  +CW99  1124      1
+CW9W 0 2013-06-20 00:00:00    NaN   NaN      0
      1 2013-06-21 00:00:00  +CW9W   956      4
      2 2013-06-22 00:00:00  +CW9W   956      2
      3 2013-06-23 00:00:00  +CW9W   956      3
      4 2013-06-24 00:00:00  +CW9W   956      4
      5 2013-06-25 00:00:00  +CW9W   956      0
      6 2013-06-26 00:00:00  +CW9W   956      0
      7 2013-06-27 00:00:00  +CW9W   956      0

This forces all the elements to be there (note the NaN becuase no way to ffill on the first element). You could drop these if you want.

edited Aug 17, 2013 at 13:24

EdChum

397k204 gold badges836 silver badges583 bronze badges

answered Jul 23, 2013 at 20:05

Jeff

130k21 gold badges223 silver badges189 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Felix Gao Over a year ago

Thanks Jeff, the duplicate is unintentional. How to solve this problem if no duplicates ts is allowed?

Jeff Over a year ago

added the non-dup one.....it get's a solution that includes ALL dates (but some the first have NaN values if its not == the min value). FYI if this detects dups it will raise an Exception.

Felix Gao Over a year ago

It almost what I needed except those NaNs. Maybe I should group by uuid and site, then reset the index to get ride of the NaNs

Jeff Over a year ago

fillna. in the function f. You could do: x['uuid'] = x['uuid'].fillna(x['uuid'].unique()[0]) for example

Jeff Over a year ago

you can also do the groupby by ['uuid','site'] as suggested. lots of things to play with here

Collectives™ on Stack Overflow

pandas dataframe fill missing rows with date column

1 Answer 1

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related