1

I have a DataFrame which looks like this:

x = pd.DataFrame({'user': ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b','b'], 'rd': ['2016-01-01', '2016-01-01' ,
                        '2016-02-01', '2016-02-01', '2016-02-01',  '2016-05-01', '2016-05-01', 
                            '2016-06-01','2016-06-01', '2016-06-01'],
                  'fd' : ['2016-02-01', '2016-04-01', '2016-03-01', '2016-04-01', '2016-05-01',
                         '2016-06-01', '2016-07-01', '2016-08-01', '2016-07-01', '2016-09-01'],
                  'val': [3, 4, 16, 7, 9, 2, 5, 11, 20, 1]})

x.head(6)

       fd          rd     user val
0   2016-02-01  2016-01-01  a   3
1   2016-04-01  2016-01-01  a   4
2   2016-03-01  2016-02-01  a   16
3   2016-04-01  2016-02-01  a   7
4   2016-05-01  2016-02-01  a   9
5   2016-06-01  2016-05-01  b   2

x['rd'] = pd.to_datetime(x['rd'])
x['fd'] = pd.to_datetime(x['fd'])

For each rd date I would like to have the next 3 months dates. For instance:

rd = 2016-01-01 

I would like to have:

fd = [2016-02-01, 2016-03-01, 2016-04-01]

Basically: for each rd date I want the next 3 months as fd dates. In my dataset I have missing dates both in rd (2016-03-01, 2016-04-01) and in fd once I have the rd date (rd = 2016-01-01, fd missing = 2016-03-01).

Furthermore I have 2 different users x['user'].unique() = ['a', 'b'] . So I may have missing dates (both 'rd' and 'fd') in one user, in the other or in both.

What I would like to achieve is an efficient way to get a dataframe with all dates for all users.

The question starts from an already answered one Question , but the problem here is a little more complex, since I'm not able to fit Multiindex to the problem at hand.

What I did until now was to create the 2 column of dates:

index = pd.date_range(x['rd'].min(),
                          x['rd'].max(), freq='MS')

from datetime import datetime
from dateutil.relativedelta import relativedelta
def add_months(date):
   fcs_dates = [date + relativedelta(months = 1), date + relativedelta(months = 2), date + relativedelta(months = 3)]
   return fcs_dates

fcs_dates = list(map(lambda x: add_months(x), index.tolist()))
fcs_dates = [j for i in fcs_dates for j in i]
index3 = index.tolist()*3
index3.sort()

So the output is:

list(zip(index3, fcs_dates))[:5]

[(Timestamp('2016-01-01 00:00:00', freq='MS'),
  Timestamp('2016-02-01 00:00:00', freq='MS')),
 (Timestamp('2016-01-01 00:00:00', freq='MS'),
  Timestamp('2016-03-01 00:00:00', freq='MS')),
 (Timestamp('2016-01-01 00:00:00', freq='MS'),
  Timestamp('2016-04-01 00:00:00', freq='MS')),
 (Timestamp('2016-02-01 00:00:00', freq='MS'),
  Timestamp('2016-03-01 00:00:00', freq='MS')),
 (Timestamp('2016-02-01 00:00:00', freq='MS'),
  Timestamp('2016-04-01 00:00:00', freq='MS'))]

Unfortunately I have no clue about how to plug this into MultiIndex function.

Thank you for your help

2 Answers 2

2

I'm having a lot of trouble understanding your question, and I can't get index3 to work in python 3.

Are you looking for something along these lines?

indx = pd.MultiIndex.from_product([['a', 'b'], [index], [pd.DatetimeIndex(fcs_dates)]])

If you're able to construct the levels you want in your multi-index, from_product takes their cartesian product to create the index.

Sign up to request clarification or add additional context in comments.

1 Comment

thank you, I edited the question with the date conversiont to datetime.. it should work. Unfortunately it's not what I'm looking for: doing the multiproduct between index and fcs_dates will give me also rows like these: rd = 2017-01-01 fd = 2017-07-01 which I don't want to...
1

So, I solved my own question by doing a left join for each group (user), where the left dataframe is the one constructed with dates.

pd.DataFrame with dates:

left_df = pd.DataFrame({'rd' : index_3, 'fd' : fcs_dates})
left_df['rd'] = left_df['rd'].astype(str)
left_df['fd'] = left_df['fd'].astype(str)

grouped by user DataFrame:

df_gr = x.groupby(['user'])
list_gr = []
for i, gr in df_gr:
    gr_new = pd.merge(left_df, gr, left_on= ['rd', 'fd'],
                              right_on = ['rd', 'fd'],
                             how = 'left')
    list_gr.append(gr_new)

df_final = pd.concat(list_gr)    

final dataframe:

fd  rd  user    val

0   2016-02-01  2016-01-01  a   3.0
1   2016-03-01  2016-01-01  NaN NaN
2   2016-04-01  2016-01-01  a   4.0
3   2016-03-01  2016-02-01  a   16.0
4   2016-04-01  2016-02-01  a   7.0
5   2016-05-01  2016-02-01  a   9.0
6   2016-04-01  2016-03-01  NaN NaN
7   2016-05-01  2016-03-01  NaN NaN
8   2016-06-01  2016-03-01  NaN NaN
9   2016-05-01  2016-04-01  NaN NaN
10  2016-06-01  2016-04-01  NaN NaN
11  2016-07-01  2016-04-01  NaN NaN
12  2016-06-01  2016-05-01  NaN NaN
13  2016-07-01  2016-05-01  NaN NaN
14  2016-08-01  2016-05-01  NaN NaN
15  2016-07-01  2016-06-01  NaN NaN
16  2016-08-01  2016-06-01  NaN NaN
17  2016-09-01  2016-06-01  NaN NaN
0   2016-02-01  2016-01-01  NaN NaN
1   2016-03-01  2016-01-01  NaN NaN
2   2016-04-01  2016-01-01  NaN NaN
3   2016-03-01  2016-02-01  NaN NaN
4   2016-04-01  2016-02-01  NaN NaN
5   2016-05-01  2016-02-01  NaN NaN
6   2016-04-01  2016-03-01  NaN NaN
7   2016-05-01  2016-03-01  NaN NaN
8   2016-06-01  2016-03-01  NaN NaN
9   2016-05-01  2016-04-01  NaN NaN
10  2016-06-01  2016-04-01  NaN NaN
11  2016-07-01  2016-04-01  NaN NaN
12  2016-06-01  2016-05-01  b   2.0
13  2016-07-01  2016-05-01  b   5.0
14  2016-08-01  2016-05-01  NaN NaN
15  2016-07-01  2016-06-01  b   20.0
16  2016-08-01  2016-06-01  b   11.0
17  2016-09-01  2016-06-01  b   1.0

Unfortunately I don't think this is the quickest method, but I got what I wanted.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.