Fastest way to create DataFrame from last available data

Question

I had no success looking for answers for this question in the forum since it is hard to put it in keywords. Any keywords suggestions are appreciated so that I cane make this question more accessible so that others can benefit from it.

The closest question I found doesn't really answer mine.

My problem is the following:

I have one DataFrame that I called ref, and a dates list called pub. ref has dates for indexes but those dates are different (there will be a few matching values) from the dates in pub. I want to create a new DataFrame that contains all the dates from pub but fill it with the "last available data" from ref.

Thus, say ref is:

Dat          col1 col2 
2015-01-01   5    4
2015-01-02   6    7
2015-01-05   8    9

And pub

2015-01-01
2015-01-04
2015-01-06

I'd like to create a DataFrame like:

Dat          col1 col2 
2015-01-01   5    4
2015-01-04   6    7
2015-01-06   8    9

For this matter performance is an issue. So i'm looking for the fastest / a fast way of doing that.

Thanks in advance.

do you need sequence(position) based replacement of vlaue in Dat column with a pub list? — Joshua Baboo
– Joshua Baboo, Commented Apr 18, 2016 at 20:19

Alexander · Accepted Answer · 2016-04-18 20:33:54Z

2

You can do an outer merge, set the new index to Dat, sort it, forward fill, and then reindex based on the dates in pub.

dates = ['2015-01-01', '2015-01-04', '2015-01-06']
pub = pd.DataFrame([dt.datetime.strptime(ts, '%Y-%m-%d').date() for ts in dates], 
                   columns=['Dat'])

>>> (ref
     .merge(pub, on='Dat', how='outer')
     .set_index('Dat')
     .sort_index()
     .ffill()
     .reindex(pub.Dat))
            col1  col2
Dat                   
2015-01-01     5     4
2015-01-04     6     7
2015-01-06     8     9

edited Apr 18, 2016 at 20:33

answered Apr 18, 2016 at 20:14

Alexander

111k32 gold badges212 silver badges208 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Pedro Braz Over a year ago

Hi Alexander thanks for the help. But ideally I would like the second row to display 6 and 7 not 8 and 9. Is that possible ?

Alexander Over a year ago

Should be. What types are your dates? Timestamps, python datetime objects, or strings?

B. M. · Accepted Answer · 2016-04-18 21:20:39Z

2

Use np.searchsorted for finding the indice just after ('right' option; needed to handle properly equality) :

In [27]: pub = ['2015-01-01', '2015-01-04', '2015-01-06']

In [28]: df
Out[28]: 
            col1  col2
Dat                   
2015-01-01     5     4
2015-01-02     6     7
2015-01-05     8     9

In [29]: y=np.searchsorted(list(df.index),pub,'right')
#array([1, 2, 3], dtype=int64)

Then just rebuild :

In [30]: pd.DataFrame(df.iloc[y-1].values,index=pub)
Out[30]: 
            0  1
2015-01-01  5  4
2015-01-04  6  7
2015-01-06  8  9

edited Apr 18, 2016 at 21:20

answered Apr 18, 2016 at 20:36

B. M.

18.7k2 gold badges40 silver badges56 bronze badges

1 Comment

MaxU - stand with Ukraine Over a year ago

i guess this solution should be faster - can you add a timeit comparison to your answer?

Collectives™ on Stack Overflow

Fastest way to create DataFrame from last available data

2 Answers 2

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related