This is non-trivial. I'll explain why below
Prelim, read in the original data frame and make sure that the ts column is a dtype of datetime64[ns]
# you may need to do this to get the correct dtype
df['ts'] = df['ts'].to_datetime(df['ts'])
In [107]: df
Out[107]:
uuid site ts visit
0 +CW99 1124 2013-06-24 00:00:00 2
1 +CW99 1124 2013-06-26 00:00:00 1
2 +CW99 1124 2013-06-27 00:00:00 1
3 +CW99 1124 2013-06-20 00:00:00 1
4 +CW99 1124 2013-06-21 00:00:00 1
5 +CW99 1124 2013-06-24 00:00:00 2
6 +CW9W 956 2013-06-21 00:00:00 4
7 +CW9W 956 2013-06-22 00:00:00 2
8 +CW9W 956 2013-06-23 00:00:00 3
9 +CW9W 956 2013-06-24 00:00:00 4
In [106]: df.dtypes
Out[106]:
uuid object
site int64
ts datetime64[ns]
visit int64
dtype: object
Create the master times between min and max
In [110]: all_ts = pd.date_range(df['ts'].min(),df['ts'].max())
In [111]: all_ts
Out[111]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-06-20 00:00:00, ..., 2013-06-27 00:00:00]
Length: 8, Freq: D, Timezone: None
Define a function like this
In [103]: def f(x):
# we want all of the ``ts`` column that are not in the master time series
.....: adf = DataFrame(dict(ts = all_ts-Index(x['ts'])),columns=df.columns)
# they should have visit of 0
.....: adf['visit'] = 0
# first add them to the frame (x), ignoring the index
# sort by the ts column
# then fillforward missing values
.....: return x.append(adf,ignore_index=True).sort_index(by='ts').ffill()
.....:
Apply the function (you could group by uuid,site if you want as well)
In [116]: df.groupby('uuid').apply(f)
Out[116]:
uuid site ts visit
uuid
+CW99 3 +CW99 1124 2013-06-20 00:00:00 1
4 +CW99 1124 2013-06-21 00:00:00 1
0 +CW99 1124 2013-06-24 00:00:00 2
5 +CW99 1124 2013-06-24 00:00:00 2
6 +CW99 1124 2013-06-25 00:00:00 0
1 +CW99 1124 2013-06-26 00:00:00 1
2 +CW99 1124 2013-06-27 00:00:00 1
+CW9W 0 +CW9W 956 2013-06-21 00:00:00 4
1 +CW9W 956 2013-06-22 00:00:00 2
2 +CW9W 956 2013-06-23 00:00:00 3
3 +CW9W 956 2013-06-24 00:00:00 4
4 +CW9W 956 2013-06-25 00:00:00 0
Note: you have a duplicate in the posted frame. Not sure if that was intential or not
I preserved that. This is a somewhat easier problem if you don't have duplicates (in the ts column)
Here is the no dup way
In [207]: def f(x):
.....: x = x.set_index('ts').reindex(all_ts).reset_index()
.....: x['visit'] = x['visit'].fillna(0)
.....: return x.ffill()
.....:
In [208]: df_no_dups.groupby('uuid').apply(f)
Out[208]:
index uuid site visit
uuid
+CW99 0 2013-06-20 00:00:00 +CW99 1124 1
1 2013-06-21 00:00:00 +CW99 1124 1
2 2013-06-22 00:00:00 +CW99 1124 0
3 2013-06-23 00:00:00 +CW99 1124 0
4 2013-06-24 00:00:00 +CW99 1124 2
5 2013-06-25 00:00:00 +CW99 1124 0
6 2013-06-26 00:00:00 +CW99 1124 1
7 2013-06-27 00:00:00 +CW99 1124 1
+CW9W 0 2013-06-20 00:00:00 NaN NaN 0
1 2013-06-21 00:00:00 +CW9W 956 4
2 2013-06-22 00:00:00 +CW9W 956 2
3 2013-06-23 00:00:00 +CW9W 956 3
4 2013-06-24 00:00:00 +CW9W 956 4
5 2013-06-25 00:00:00 +CW9W 956 0
6 2013-06-26 00:00:00 +CW9W 956 0
7 2013-06-27 00:00:00 +CW9W 956 0
This forces all the elements to be there (note the NaN becuase no way to ffill on the first element). You could drop these if you want.