252

My data can have multiple events on a given date or NO events on a date. I take these events, get a count by date and plot them. However, when I plot them, my two series don't always match.

idx = pd.date_range(df['simpleDate'].min(), df['simpleDate'].max())
s = df.groupby(['simpleDate']).size()

In the above code idx becomes a range of say 30 dates. 09-01-2013 to 09-30-2013 However S may only have 25 or 26 days because no events happened for a given date. I then get an AssertionError as the sizes dont match when I try to plot:

fig, ax = plt.subplots()    
ax.bar(idx.to_pydatetime(), s, color='green')

What's the proper way to tackle this? Do I want to remove dates with no values from IDX or (which I'd rather do) is add to the series the missing date with a count of 0. I'd rather have a full graph of 30 days with 0 values. If this approach is right, any suggestions on how to get started? Do I need some sort of dynamic reindex function?

Here's a snippet of S ( df.groupby(['simpleDate']).size() ), notice no entries for 04 and 05.

09-02-2013     2
09-03-2013    10
09-06-2013     5
09-07-2013     1
0

7 Answers 7

429

You could use Series.reindex:

import pandas as pd

idx = pd.date_range('09-01-2013', '09-30-2013')

s = pd.Series({'09-02-2013': 2,
               '09-03-2013': 10,
               '09-06-2013': 5,
               '09-07-2013': 1})
s.index = pd.DatetimeIndex(s.index)

s = s.reindex(idx, fill_value=0)
print(s)

yields

2013-09-01     0
2013-09-02     2
2013-09-03    10
2013-09-04     0
2013-09-05     0
2013-09-06     5
2013-09-07     1
2013-09-08     0
...
Sign up to request clarification or add additional context in comments.

9 Comments

reindex is an amazing function. It can (1) reorder existing data to match a new set of labels, (2) insert new rows where no label previously existed, (3) fill data for missing labels, (including by forward/backward filling) (4) select rows by label!
There is one problem (or bug) with reindex though: it doesn't work with dates before 1/1/1970, so in this case df.resample() works perfectly.
you may use this instead for idx to skip entering start and end dates manually: idx = pd.date_range(df.index.min(), df.index.max())
Dropping the link to the documentation here, to save you the search: pandas.pydata.org/pandas-docs/stable/reference/api/…
reindex does not work at least anymore
|
102

A quicker workaround is to use .asfreq(). This doesn't require creation of a new index to call within .reindex().

# "broken" (staggered) dates
dates = pd.Index([pd.Timestamp('2012-05-01'), 
                  pd.Timestamp('2012-05-04'), 
                  pd.Timestamp('2012-05-06')])
s = pd.Series([1, 2, 3], dates)

print(s.asfreq('D'))
2012-05-01    1.0
2012-05-02    NaN
2012-05-03    NaN
2012-05-04    2.0
2012-05-05    NaN
2012-05-06    3.0
Freq: D, dtype: float64

5 Comments

I really prefer this method; you avoid having to call date_range since it implicitly uses the first and last index as the start and end (which is what you would almost always want).
Very clean and professional method. Works well with using interpolate afterwards as well.
I second this. This is also a great method to use before merging two dataframes of different index length where joins, merges etc. almost always leads to errors such as a column full of NaNs.
Thanks for your answer but I still have a question. Given that I want to start on date x-x-x and end on date y-y-y and on my dataset 's' I have dates e-e-e to f-f-f, that are between dates x-x-x and y-y-y. Using "asfreq" how can I fill the dates on my dataset 's' from x-x-x to y-y-y? I have not found on the docs. Thank you
Yes I used this method to insert NaN in missing dates in a dataframe before plotting it with matplatlib
38

An alternative approach is resample, which can handle duplicate dates in addition to missing dates. For example:

df.resample('D').mean()

resample is a deferred operation like groupby so you need to follow it with another operation. In this case mean works well, but you can also use many other pandas methods like max, sum, etc.

Here is the original data, but with an extra entry for '2013-09-03':

             val
date           
2013-09-02     2
2013-09-03    10
2013-09-03    20    <- duplicate date added to OP's data
2013-09-06     5
2013-09-07     1

And here are the results:

             val
date            
2013-09-02   2.0
2013-09-03  15.0    <- mean of original values for 2013-09-03
2013-09-04   NaN    <- NaN b/c date not present in orig
2013-09-05   NaN    <- NaN b/c date not present in orig
2013-09-06   5.0
2013-09-07   1.0

I left the missing dates as NaNs to make it clear how this works, but you can add fillna(0) to replace NaNs with zeroes as requested by the OP or alternatively use something like interpolate() to fill with non-zero values based on the neighboring rows.

Comments

37

One issue is that reindex will fail if there are duplicate values. Say we're working with timestamped data, which we want to index by date:

df = pd.DataFrame({
    'timestamps': pd.to_datetime(
        ['2016-11-15 1:00','2016-11-16 2:00','2016-11-16 3:00','2016-11-18 4:00']),
    'values':['a','b','c','d']})
df.index = pd.DatetimeIndex(df['timestamps']).floor('D')
df

yields

            timestamps             values
2016-11-15  "2016-11-15 01:00:00"  a
2016-11-16  "2016-11-16 02:00:00"  b
2016-11-16  "2016-11-16 03:00:00"  c
2016-11-18  "2016-11-18 04:00:00"  d

Due to the duplicate 2016-11-16 date, an attempt to reindex:

all_days = pd.date_range(df.index.min(), df.index.max(), freq='D')
df.reindex(all_days)

fails with:

...
ValueError: cannot reindex from a duplicate axis

(by this it means the index has duplicates, not that it is itself a dup)

Instead, we can use .loc to look up entries for all dates in range:

df.loc[all_days]

yields

            timestamps             values
2016-11-15  "2016-11-15 01:00:00"  a
2016-11-16  "2016-11-16 02:00:00"  b
2016-11-16  "2016-11-16 03:00:00"  c
2016-11-17  NaN                    NaN
2016-11-18  "2016-11-18 04:00:00"  d

fillna can be used on the column series to fill blanks if needed.

2 Comments

Any idea on what to do if Date column contains Blanks or NULLS? df.loc[all_days] won't work in that case.
Passing list-likes to .loc or [] with any missing label will raise KeyError in the future, you can use .reindex() as an alternative. See the documentation here: pandas.pydata.org/pandas-docs/stable/…
10

Here's a nice method to fill in missing dates into a dataframe, with your choice of fill_value, days_back to fill in, and sort order (date_order) by which to sort the dataframe:

def fill_in_missing_dates(df, date_col_name = 'date',date_order = 'asc', fill_value = 0, days_back = 30):

    df.set_index(date_col_name,drop=True,inplace=True)
    df.index = pd.DatetimeIndex(df.index)
    d = datetime.now().date()
    d2 = d - timedelta(days = days_back)
    idx = pd.date_range(d2, d, freq = "D")
    df = df.reindex(idx,fill_value=fill_value)
    df[date_col_name] = pd.DatetimeIndex(df.index)

    return df

Comments

2
s.asfreq('D').interpolate().asfreq('Q')

Comments

2

You can always just use DataFrame.merge() utilizing a left join from an 'All Dates' DataFrame to the 'Missing Dates' DataFrame. Example below.

# example DataFrame with missing dates between min(date) and max(date)
missing_df = pd.DataFrame({
    'date':pd.to_datetime([
        '2022-02-10'
        ,'2022-02-11'
        ,'2022-02-14'
        ,'2022-02-14'
        ,'2022-02-24'
        ,'2022-02-16'
    ])
    ,'value':[10,20,5,10,15,30]
})

# first create a DataFrame with all dates between specified start<-->end using pd.date_range()
all_dates = pd.DataFrame(pd.date_range(missing_df['date'].min(), missing_df['date'].max()), columns=['date'])

# from the all_dates DataFrame, left join onto the DataFrame with missing dates
new_df = all_dates.merge(right=missing_df, how='left', on='date')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.