I want to add new files to historical table (both are in csv format and they are not in db), before that, I need to check new file with historical table by comparing its two column in particular, one is state and another one is date column. First, I need to check max (state, yyyy_mm), then check those entries with max(state, yyyy_mm) in historical table; if they are not historical table, then append them, otherwise do nothing.
So far I am able to pick the rows with max (state, yyyy_mm), but when I tried to compare those picked rows with historical table, I am not getting expected output. I tried pandas.merge, pandas.concat but output is not same with my expected output. Can anyone point me out how to do this in pandas? Any thoughts?
Input data:
>>> src_df.to_dict()
{'yyyy_mm': {0: 202001,
1: 202002,
2: 202003,
3: 202002,
4: 202107,
5: 202108,
6: 202109},
'state': {0: 'CA', 1: 'NJ', 2: 'NY', 3: 'NY', 4: 'PA', 5: 'PA', 6: 'PA'},
'col1': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3, 5: 3, 6: 3},
'col2': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3, 5: 3, 6: 4},
'col3': {0: 7, 1: 7, 2: 7, 3: 7, 4: 7, 5: 7, 6: 7}}
>>> hist_df.to_dict()
{'yyyy_mm': {0: 202101,
1: 202002,
2: 202001,
3: 201901,
4: 201907,
5: 201908,
6: 201901,
7: 201907,
8: 201908},
'state': {0: 'CA',
1: 'NJ',
2: 'NY',
3: 'NY',
4: 'NY',
5: 'NY',
6: 'PA',
7: 'PA',
8: 'PA'},
'col1': {0: 1, 1: 3, 2: 4, 3: 4, 4: 4, 5: 4, 6: 4, 7: 4, 8: 4},
'col2': {0: 1, 1: 3, 2: 5, 3: 5, 4: 5, 5: 5, 6: 5, 7: 5, 8: 5},
'col3': {0: 1, 1: 7, 2: 8, 3: 8, 4: 8, 5: 8, 6: 8, 7: 8, 8: 8}}
My current attempt:
picked_rows = src_df.loc[src_df.groupby('state')['yyyy_mm'].idxmax()]
>>> picked_rows.to_dict()
{'yyyy_mm': {0: 202001, 1: 202002, 2: 202003, 6: 202109},
'state': {0: 'CA', 1: 'NJ', 2: 'NY', 6: 'PA'},
'col1': {0: 3, 1: 3, 2: 3, 6: 3},
'col2': {0: 3, 1: 3, 2: 3, 6: 4},
'col3': {0: 7, 1: 7, 2: 7, 6: 7}}
Then I tried to do following but output is not same as my expected output:
output_df = pd.concat(picked_rows, hist_df, keys=['state', 'yyyy_mm'], axis=1) # first attempt
output_df = pd.merge(picked_rows, hist_df, how='outer') # second attempt
but both of those attempt not giving me my expected output. How should I get my desired output by comparing two dataframes where picked_rows should be append to hist_df by conditionally such as max('state', 'yyyy_mm'). How should we do this in pandas?
objective
I want to check picked_rows in hist_df where I need to check by state and yyyy_mm columns, so only add entries from picked_rows where state has max value or recent dates. I created desired output below. I tried inner join or pandas.concat but it is not giving me correct out. Does anyone have any ideas on this?
Here is my desired output that I want to get:
yyyy_mm state col1 col2 col3
0 202101 CA 1 1 1
1 202002 NJ 3 3 7
2 202001 NY 4 5 8
3 201901 NY 4 5 8
4 201907 NY 4 5 8
5 201908 NY 4 5 8
6 201901 PA 4 5 8
7 201907 PA 4 5 8
8 201908 PA 4 5 8
9 202003 NY 3 3 7
10 202109 PA 3 4 7
src_df.head(10).to_dict()andhist_df.head(10).to_dict()src_df.to_dict(), hist_df.to_dict(), desired_output.to_dict(). Do you have any ideas how to get my expected output correctly in pandas? thankssrc_dfandhist_dfhas more than 100+ colunms. Can we do this only mappingstate,yyyy_mmcolumns while keeping same logic here? any thoughts?stateandyyyy_mm. Please try the answer before asking follow-up questions.hist_df, so I tried this instead:new_data = src_df[src_df["yyyy_mm"].gt(src_df["state"].map(hist_df.groupby("state")["yyyy_mm"]))], just remove themax(), but by doing this I got NAN values in columns. Any idea?