4
    uuid  site          ts  visit
0  +CW99  1124  2013-06-24     2
1  +CW99  1124  2013-06-26     1
2  +CW99  1124  2013-06-27     1
3  +CW99  1124  2013-06-20     1
4  +CW99  1124  2013-06-21     1
5  +CW99  1124  2013-06-24     2
6  +CW9W   956  2013-06-21     4
7  +CW9W   956  2013-06-22     2
8  +CW9W   956  2013-06-23     3
9  +CW9W   956  2013-06-24     4

I have a dataframe that is passed in looks like that. I don't really know how many distinct ts period are there. So I want to find the min and max of the ts then expands the rows that does not have all the ts (daily) to zero fill the visit column while copying the rest of the fields

        uuid  site          ts  visit
    0  +CW99  1124  2013-06-20     1
    1  +CW99  1124  2013-06-21     1
    2  +CW99  1124  2013-06-22     1
    3  +CW99  1124  2013-06-23     0
    4  +CW99  1124  2013-06-24     2
    5  +CW99  1124  2013-06-25     0
    6  +CW99  1124  2013-06-26     1
    7  +CW99  1124  2013-06-27     1
    8  +CW9W   956  2013-06-20     0
    9  +CW9W   956  2013-06-21     4
   10  +CW9W   956  2013-06-22     2
   11  +CW9W   956  2013-06-23     3
   12  +CW9W   956  2013-06-24     4
   13  +CW9W   956  2013-06-25     0
   14  +CW9W   956  2013-06-26     0
   15  +CW9W   956  2013-06-27     0

Thanks.

1 Answer 1

4

This is non-trivial. I'll explain why below

Prelim, read in the original data frame and make sure that the ts column is a dtype of datetime64[ns]

# you may need to do this to get the correct dtype   
df['ts'] = df['ts'].to_datetime(df['ts'])

In [107]: df
Out[107]: 
    uuid  site                  ts  visit
0  +CW99  1124 2013-06-24 00:00:00      2
1  +CW99  1124 2013-06-26 00:00:00      1
2  +CW99  1124 2013-06-27 00:00:00      1
3  +CW99  1124 2013-06-20 00:00:00      1
4  +CW99  1124 2013-06-21 00:00:00      1
5  +CW99  1124 2013-06-24 00:00:00      2
6  +CW9W   956 2013-06-21 00:00:00      4
7  +CW9W   956 2013-06-22 00:00:00      2
8  +CW9W   956 2013-06-23 00:00:00      3
9  +CW9W   956 2013-06-24 00:00:00      4

In [106]: df.dtypes
Out[106]: 
uuid             object
site              int64
ts       datetime64[ns]
visit             int64
dtype: object

Create the master times between min and max

In [110]: all_ts = pd.date_range(df['ts'].min(),df['ts'].max())

In [111]: all_ts
Out[111]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-06-20 00:00:00, ..., 2013-06-27 00:00:00]
Length: 8, Freq: D, Timezone: None

Define a function like this

In [103]: def f(x):

              # we want all of the ``ts`` column that are not in the master time series
   .....:     adf = DataFrame(dict(ts = all_ts-Index(x['ts'])),columns=df.columns)

              # they should have visit of 0
   .....:     adf['visit'] = 0

              # first add them to the frame (x), ignoring the index
              # sort by the ts column
              # then fillforward missing values
   .....:     return x.append(adf,ignore_index=True).sort_index(by='ts').ffill()
   .....: 

Apply the function (you could group by uuid,site if you want as well)

In [116]: df.groupby('uuid').apply(f)
Out[116]: 
          uuid  site                  ts  visit
uuid                                           
+CW99 3  +CW99  1124 2013-06-20 00:00:00      1
      4  +CW99  1124 2013-06-21 00:00:00      1
      0  +CW99  1124 2013-06-24 00:00:00      2
      5  +CW99  1124 2013-06-24 00:00:00      2
      6  +CW99  1124 2013-06-25 00:00:00      0
      1  +CW99  1124 2013-06-26 00:00:00      1
      2  +CW99  1124 2013-06-27 00:00:00      1
+CW9W 0  +CW9W   956 2013-06-21 00:00:00      4
      1  +CW9W   956 2013-06-22 00:00:00      2
      2  +CW9W   956 2013-06-23 00:00:00      3
      3  +CW9W   956 2013-06-24 00:00:00      4
      4  +CW9W   956 2013-06-25 00:00:00      0

Note: you have a duplicate in the posted frame. Not sure if that was intential or not I preserved that. This is a somewhat easier problem if you don't have duplicates (in the ts column)

Here is the no dup way

In [207]: def f(x):
   .....:     x = x.set_index('ts').reindex(all_ts).reset_index()
   .....:     x['visit'] = x['visit'].fillna(0)
   .....:     return x.ffill()
   .....: 

In [208]: df_no_dups.groupby('uuid').apply(f)
Out[208]: 
                      index   uuid  site  visit
uuid                                           
+CW99 0 2013-06-20 00:00:00  +CW99  1124      1
      1 2013-06-21 00:00:00  +CW99  1124      1
      2 2013-06-22 00:00:00  +CW99  1124      0
      3 2013-06-23 00:00:00  +CW99  1124      0
      4 2013-06-24 00:00:00  +CW99  1124      2
      5 2013-06-25 00:00:00  +CW99  1124      0
      6 2013-06-26 00:00:00  +CW99  1124      1
      7 2013-06-27 00:00:00  +CW99  1124      1
+CW9W 0 2013-06-20 00:00:00    NaN   NaN      0
      1 2013-06-21 00:00:00  +CW9W   956      4
      2 2013-06-22 00:00:00  +CW9W   956      2
      3 2013-06-23 00:00:00  +CW9W   956      3
      4 2013-06-24 00:00:00  +CW9W   956      4
      5 2013-06-25 00:00:00  +CW9W   956      0
      6 2013-06-26 00:00:00  +CW9W   956      0
      7 2013-06-27 00:00:00  +CW9W   956      0

This forces all the elements to be there (note the NaN becuase no way to ffill on the first element). You could drop these if you want.

Sign up to request clarification or add additional context in comments.

5 Comments

Thanks Jeff, the duplicate is unintentional. How to solve this problem if no duplicates ts is allowed?
added the non-dup one.....it get's a solution that includes ALL dates (but some the first have NaN values if its not == the min value). FYI if this detects dups it will raise an Exception.
It almost what I needed except those NaNs. Maybe I should group by uuid and site, then reset the index to get ride of the NaNs
fillna. in the function f. You could do: x['uuid'] = x['uuid'].fillna(x['uuid'].unique()[0]) for example
you can also do the groupby by ['uuid','site'] as suggested. lots of things to play with here

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.