Add missing dates to pandas dataframe

Question

My data can have multiple events on a given date or NO events on a date. I take these events, get a count by date and plot them. However, when I plot them, my two series don't always match.

idx = pd.date_range(df['simpleDate'].min(), df['simpleDate'].max())
s = df.groupby(['simpleDate']).size()

In the above code idx becomes a range of say 30 dates. 09-01-2013 to 09-30-2013 However S may only have 25 or 26 days because no events happened for a given date. I then get an AssertionError as the sizes dont match when I try to plot:

fig, ax = plt.subplots()    
ax.bar(idx.to_pydatetime(), s, color='green')

What's the proper way to tackle this? Do I want to remove dates with no values from IDX or (which I'd rather do) is add to the series the missing date with a count of 0. I'd rather have a full graph of 30 days with 0 values. If this approach is right, any suggestions on how to get started? Do I need some sort of dynamic reindex function?

Here's a snippet of S ( df.groupby(['simpleDate']).size() ), notice no entries for 04 and 05.

09-02-2013     2
09-03-2013    10
09-06-2013     5
09-07-2013     1

unutbu · Accepted Answer · 2013-10-11 18:36:56Z

429

You could use Series.reindex:

import pandas as pd

idx = pd.date_range('09-01-2013', '09-30-2013')

s = pd.Series({'09-02-2013': 2,
               '09-03-2013': 10,
               '09-06-2013': 5,
               '09-07-2013': 1})
s.index = pd.DatetimeIndex(s.index)

s = s.reindex(idx, fill_value=0)
print(s)

yields

2013-09-01     0
2013-09-02     2
2013-09-03    10
2013-09-04     0
2013-09-05     0
2013-09-06     5
2013-09-07     1
2013-09-08     0
...

edited Oct 11, 2013 at 18:36

answered Oct 11, 2013 at 18:08

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

9 Comments

unutbu Over a year ago

reindex is an amazing function. It can (1) reorder existing data to match a new set of labels, (2) insert new rows where no label previously existed, (3) fill data for missing labels, (including by forward/backward filling) (4) select rows by label!

Sergey Gulbin Over a year ago

There is one problem (or bug) with reindex though: it doesn't work with dates before 1/1/1970, so in this case df.resample() works perfectly.

Reveille Over a year ago

you may use this instead for idx to skip entering start and end dates manually: idx = pd.date_range(df.index.min(), df.index.max())

Harm Over a year ago

Dropping the link to the documentation here, to save you the search: pandas.pydata.org/pandas-docs/stable/reference/api/…

intergallactic Over a year ago

reindex does not work at least anymore

|

Brad Solomon · Accepted Answer · 2018-06-28 19:05:46Z

102

A quicker workaround is to use .asfreq(). This doesn't require creation of a new index to call within .reindex().

# "broken" (staggered) dates
dates = pd.Index([pd.Timestamp('2012-05-01'), 
                  pd.Timestamp('2012-05-04'), 
                  pd.Timestamp('2012-05-06')])
s = pd.Series([1, 2, 3], dates)

print(s.asfreq('D'))
2012-05-01    1.0
2012-05-02    NaN
2012-05-03    NaN
2012-05-04    2.0
2012-05-05    NaN
2012-05-06    3.0
Freq: D, dtype: float64

edited Jun 28, 2018 at 19:05

answered Aug 2, 2017 at 19:18

Brad Solomon

41.2k39 gold badges167 silver badges260 bronze badges

5 Comments

Michael Hays Over a year ago

I really prefer this method; you avoid having to call date_range since it implicitly uses the first and last index as the start and end (which is what you would almost always want).

msarafzadeh Over a year ago

Very clean and professional method. Works well with using interpolate afterwards as well.

user3661992 Over a year ago

I second this. This is also a great method to use before merging two dataframes of different index length where joins, merges etc. almost always leads to errors such as a column full of NaNs.

Catarina Nogueira Over a year ago

Thanks for your answer but I still have a question. Given that I want to start on date x-x-x and end on date y-y-y and on my dataset 's' I have dates e-e-e to f-f-f, that are between dates x-x-x and y-y-y. Using "asfreq" how can I fill the dates on my dataset 's' from x-x-x to y-y-y? I have not found on the docs. Thank you

PerseP Over a year ago

Yes I used this method to insert NaN in missing dates in a dataframe before plotting it with matplatlib

JohnE · Accepted Answer · 2019-01-04 16:35:17Z

An alternative approach is resample, which can handle duplicate dates in addition to missing dates. For example:

df.resample('D').mean()

resample is a deferred operation like groupby so you need to follow it with another operation. In this case mean works well, but you can also use many other pandas methods like max, sum, etc.

Here is the original data, but with an extra entry for '2013-09-03':

             val
date           
2013-09-02     2
2013-09-03    10
2013-09-03    20    <- duplicate date added to OP's data
2013-09-06     5
2013-09-07     1

And here are the results:

             val
date            
2013-09-02   2.0
2013-09-03  15.0    <- mean of original values for 2013-09-03
2013-09-04   NaN    <- NaN b/c date not present in orig
2013-09-05   NaN    <- NaN b/c date not present in orig
2013-09-06   5.0
2013-09-07   1.0

I left the missing dates as NaNs to make it clear how this works, but you can add fillna(0) to replace NaNs with zeroes as requested by the OP or alternatively use something like interpolate() to fill with non-zero values based on the neighboring rows.

Nick Edgar · Accepted Answer · 2016-11-16 23:36:31Z

37

One issue is that reindex will fail if there are duplicate values. Say we're working with timestamped data, which we want to index by date:

df = pd.DataFrame({
    'timestamps': pd.to_datetime(
        ['2016-11-15 1:00','2016-11-16 2:00','2016-11-16 3:00','2016-11-18 4:00']),
    'values':['a','b','c','d']})
df.index = pd.DatetimeIndex(df['timestamps']).floor('D')
df

yields

            timestamps             values
2016-11-15  "2016-11-15 01:00:00"  a
2016-11-16  "2016-11-16 02:00:00"  b
2016-11-16  "2016-11-16 03:00:00"  c
2016-11-18  "2016-11-18 04:00:00"  d

Due to the duplicate 2016-11-16 date, an attempt to reindex:

all_days = pd.date_range(df.index.min(), df.index.max(), freq='D')
df.reindex(all_days)

fails with:

...
ValueError: cannot reindex from a duplicate axis

(by this it means the index has duplicates, not that it is itself a dup)

Instead, we can use .loc to look up entries for all dates in range:

df.loc[all_days]

yields

            timestamps             values
2016-11-15  "2016-11-15 01:00:00"  a
2016-11-16  "2016-11-16 02:00:00"  b
2016-11-16  "2016-11-16 03:00:00"  c
2016-11-17  NaN                    NaN
2016-11-18  "2016-11-18 04:00:00"  d

fillna can be used on the column series to fill blanks if needed.

answered Nov 16, 2016 at 23:36

Nick Edgar

1,3881 gold badge10 silver badges7 bronze badges

2 Comments

Furqan Hashim Over a year ago

Any idea on what to do if Date column contains Blanks or NULLS? df.loc[all_days] won't work in that case.

D M Over a year ago

Passing list-likes to .loc or [] with any missing label will raise KeyError in the future, you can use .reindex() as an alternative. See the documentation here: pandas.pydata.org/pandas-docs/stable/…

Midavalo · Accepted Answer · 2017-02-28 02:07:52Z

10

Here's a nice method to fill in missing dates into a dataframe, with your choice of fill_value, days_back to fill in, and sort order (date_order) by which to sort the dataframe:

def fill_in_missing_dates(df, date_col_name = 'date',date_order = 'asc', fill_value = 0, days_back = 30):

    df.set_index(date_col_name,drop=True,inplace=True)
    df.index = pd.DatetimeIndex(df.index)
    d = datetime.now().date()
    d2 = d - timedelta(days = days_back)
    idx = pd.date_range(d2, d, freq = "D")
    df = df.reindex(idx,fill_value=fill_value)
    df[date_col_name] = pd.DatetimeIndex(df.index)

    return df

edited Feb 28, 2017 at 2:07

Midavalo

4892 gold badges21 silver badges31 bronze badges

answered Feb 25, 2016 at 10:59

eiTan LaVi

3,08928 silver badges15 bronze badges

Comments

thistleknot · Accepted Answer · 2022-06-11 22:20:23Z

2

s.asfreq('D').interpolate().asfreq('Q')

answered Jun 11, 2022 at 22:20

thistleknot

1,1781 gold badge18 silver badges45 bronze badges

Comments

Sylvester is on codidact.com · Accepted Answer · 2022-12-19 17:16:01Z

2

You can always just use DataFrame.merge() utilizing a left join from an 'All Dates' DataFrame to the 'Missing Dates' DataFrame. Example below.

# example DataFrame with missing dates between min(date) and max(date)
missing_df = pd.DataFrame({
    'date':pd.to_datetime([
        '2022-02-10'
        ,'2022-02-11'
        ,'2022-02-14'
        ,'2022-02-14'
        ,'2022-02-24'
        ,'2022-02-16'
    ])
    ,'value':[10,20,5,10,15,30]
})

# first create a DataFrame with all dates between specified start<-->end using pd.date_range()
all_dates = pd.DataFrame(pd.date_range(missing_df['date'].min(), missing_df['date'].max()), columns=['date'])

# from the all_dates DataFrame, left join onto the DataFrame with missing dates
new_df = all_dates.merge(right=missing_df, how='left', on='date')

edited Dec 19, 2022 at 17:16

Sylvester is on codidact.com

4,2417 gold badges19 silver badges46 bronze badges

answered Feb 16, 2022 at 5:29

Hakuna-Patata

211 bronze badge

Collectives™ on Stack Overflow

Add missing dates to pandas dataframe

7 Answers 7

9 Comments

5 Comments

Comments

2 Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

7 Answers 7

9 Comments

5 Comments

Comments

2 Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related