2

I had no success looking for answers for this question in the forum since it is hard to put it in keywords. Any keywords suggestions are appreciated so that I cane make this question more accessible so that others can benefit from it.

The closest question I found doesn't really answer mine.

My problem is the following:

I have one DataFrame that I called ref, and a dates list called pub. ref has dates for indexes but those dates are different (there will be a few matching values) from the dates in pub. I want to create a new DataFrame that contains all the dates from pub but fill it with the "last available data" from ref.

Thus, say ref is:

Dat          col1 col2 
2015-01-01   5    4
2015-01-02   6    7
2015-01-05   8    9

And pub

2015-01-01
2015-01-04
2015-01-06

I'd like to create a DataFrame like:

Dat          col1 col2 
2015-01-01   5    4
2015-01-04   6    7
2015-01-06   8    9

For this matter performance is an issue. So i'm looking for the fastest / a fast way of doing that.

Thanks in advance.

1
  • do you need sequence(position) based replacement of vlaue in Dat column with a pub list? Commented Apr 18, 2016 at 20:19

2 Answers 2

2

You can do an outer merge, set the new index to Dat, sort it, forward fill, and then reindex based on the dates in pub.

dates = ['2015-01-01', '2015-01-04', '2015-01-06']
pub = pd.DataFrame([dt.datetime.strptime(ts, '%Y-%m-%d').date() for ts in dates], 
                   columns=['Dat'])

>>> (ref
     .merge(pub, on='Dat', how='outer')
     .set_index('Dat')
     .sort_index()
     .ffill()
     .reindex(pub.Dat))
            col1  col2
Dat                   
2015-01-01     5     4
2015-01-04     6     7
2015-01-06     8     9
Sign up to request clarification or add additional context in comments.

2 Comments

Hi Alexander thanks for the help. But ideally I would like the second row to display 6 and 7 not 8 and 9. Is that possible ?
Should be. What types are your dates? Timestamps, python datetime objects, or strings?
2

Use np.searchsorted for finding the indice just after ('right' option; needed to handle properly equality) :

In [27]: pub = ['2015-01-01', '2015-01-04', '2015-01-06']

In [28]: df
Out[28]: 
            col1  col2
Dat                   
2015-01-01     5     4
2015-01-02     6     7
2015-01-05     8     9

In [29]: y=np.searchsorted(list(df.index),pub,'right')
#array([1, 2, 3], dtype=int64)

Then just rebuild :

In [30]: pd.DataFrame(df.iloc[y-1].values,index=pub)
Out[30]: 
            0  1
2015-01-01  5  4
2015-01-04  6  7
2015-01-06  8  9

1 Comment

i guess this solution should be faster - can you add a timeit comparison to your answer?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.