how to compare two dataframes by multiple columns and only append new entries in pandas?

Question

I want to add new files to historical table (both are in csv format and they are not in db), before that, I need to check new file with historical table by comparing its two column in particular, one is state and another one is date column. First, I need to check max (state, yyyy_mm), then check those entries with max(state, yyyy_mm) in historical table; if they are not historical table, then append them, otherwise do nothing.

So far I am able to pick the rows with max (state, yyyy_mm), but when I tried to compare those picked rows with historical table, I am not getting expected output. I tried pandas.merge, pandas.concat but output is not same with my expected output. Can anyone point me out how to do this in pandas? Any thoughts?

Input data:

>>> src_df.to_dict()   
{'yyyy_mm': {0: 202001,
  1: 202002,
  2: 202003,
  3: 202002,
  4: 202107,
  5: 202108,
  6: 202109},
 'state': {0: 'CA', 1: 'NJ', 2: 'NY', 3: 'NY', 4: 'PA', 5: 'PA', 6: 'PA'},
 'col1': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3, 5: 3, 6: 3},
 'col2': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3, 5: 3, 6: 4},
 'col3': {0: 7, 1: 7, 2: 7, 3: 7, 4: 7, 5: 7, 6: 7}}

>>> hist_df.to_dict()
{'yyyy_mm': {0: 202101,
  1: 202002,
  2: 202001,
  3: 201901,
  4: 201907,
  5: 201908,
  6: 201901,
  7: 201907,
  8: 201908},
 'state': {0: 'CA',
  1: 'NJ',
  2: 'NY',
  3: 'NY',
  4: 'NY',
  5: 'NY',
  6: 'PA',
  7: 'PA',
  8: 'PA'},
 'col1': {0: 1, 1: 3, 2: 4, 3: 4, 4: 4, 5: 4, 6: 4, 7: 4, 8: 4},
 'col2': {0: 1, 1: 3, 2: 5, 3: 5, 4: 5, 5: 5, 6: 5, 7: 5, 8: 5},
 'col3': {0: 1, 1: 7, 2: 8, 3: 8, 4: 8, 5: 8, 6: 8, 7: 8, 8: 8}}

My current attempt:

picked_rows = src_df.loc[src_df.groupby('state')['yyyy_mm'].idxmax()]

>>> picked_rows.to_dict()
{'yyyy_mm': {0: 202001, 1: 202002, 2: 202003, 6: 202109},
 'state': {0: 'CA', 1: 'NJ', 2: 'NY', 6: 'PA'},
 'col1': {0: 3, 1: 3, 2: 3, 6: 3},
 'col2': {0: 3, 1: 3, 2: 3, 6: 4},
 'col3': {0: 7, 1: 7, 2: 7, 6: 7}}

Then I tried to do following but output is not same as my expected output:

output_df = pd.concat(picked_rows, hist_df, keys=['state', 'yyyy_mm'], axis=1)  # first attempt
output_df = pd.merge(picked_rows, hist_df, how='outer')   # second attempt

but both of those attempt not giving me my expected output. How should I get my desired output by comparing two dataframes where picked_rows should be append to hist_df by conditionally such as max('state', 'yyyy_mm'). How should we do this in pandas?

objective

I want to check picked_rows in hist_df where I need to check by state and yyyy_mm columns, so only add entries from picked_rows where state has max value or recent dates. I created desired output below. I tried inner join or pandas.concat but it is not giving me correct out. Does anyone have any ideas on this?

Here is my desired output that I want to get:

    yyyy_mm state  col1  col2  col3
0    202101    CA     1     1     1
1    202002    NJ     3     3     7
2    202001    NY     4     5     8
3    201901    NY     4     5     8
4    201907    NY     4     5     8
5    201908    NY     4     5     8
6    201901    PA     4     5     8
7    201907    PA     4     5     8
8    201908    PA     4     5     8
9    202003    NY     3     3     7
10   202109    PA     3     4     7

Provide sample of your input dataframes using src_df.head(10).to_dict() and hist_df.head(10).to_dict() — not_speshal
– not_speshal, Commented May 7, 2022 at 21:23
@not_speshal updated my post with src_df.to_dict(), hist_df.to_dict(), desired_output.to_dict(). Do you have any ideas how to get my expected output correctly in pandas? thanks — kim
– kim, Commented May 7, 2022 at 21:38
@not_speshal in my actual dataset, both src_df and hist_df has more than 100+ colunms. Can we do this only mapping state, yyyy_mm columns while keeping same logic here? any thoughts? — kim
– kim, Commented May 9, 2022 at 17:31
My answer does not reference any columns other than state and yyyy_mm. Please try the answer before asking follow-up questions. — not_speshal
– not_speshal, Commented May 9, 2022 at 17:51
@not_speshal my apology, I meant if I already took care of max in hist_df, so I tried this instead: new_data = src_df[src_df["yyyy_mm"].gt(src_df["state"].map(hist_df.groupby("state")["yyyy_mm"]))], just remove the max(), but by doing this I got NAN values in columns. Any idea? — kim
– kim, Commented May 9, 2022 at 18:08

not_speshal · Accepted Answer · 2022-05-07 22:00:43Z

You should change your picked_rows DataFrame to only include dates that are greater than the hist_df dates:

#keep only rows that are newer than in hist_df
new_data = src_df[src_df["yyyy_mm"].gt(src_df["state"].map(hist_df.groupby("state")["yyyy_mm"].max()))]

#of the new rows, keep the latest updated values
picked_rows = new_data.loc[new_data.groupby("state")["yyyy_mm"].idxmax()]

#concat to hist_df
output_df = pd.concat([hist_df, picked_rows], ignore_index=True)

>>> output_df
    yyyy_mm state  col1  col2  col3
0    202101    CA     1     1     1
1    202002    NJ     3     3     7
2    202001    NY     4     5     8
3    201901    NY     4     5     8
4    201907    NY     4     5     8
5    201908    NY     4     5     8
6    201901    PA     4     5     8
7    201907    PA     4     5     8
8    201908    PA     4     5     8
9    202003    NY     3     3     7
10   202109    PA     3     4     7

Collectives™ on Stack Overflow

how to compare two dataframes by multiple columns and only append new entries in pandas?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related