1

I want to add new files to historical table (both are in csv format and they are not in db), before that, I need to check new file with historical table by comparing its two column in particular, one is state and another one is date column. First, I need to check max (state, yyyy_mm), then check those entries with max(state, yyyy_mm) in historical table; if they are not historical table, then append them, otherwise do nothing.

So far I am able to pick the rows with max (state, yyyy_mm), but when I tried to compare those picked rows with historical table, I am not getting expected output. I tried pandas.merge, pandas.concat but output is not same with my expected output. Can anyone point me out how to do this in pandas? Any thoughts?

Input data:

>>> src_df.to_dict()   
{'yyyy_mm': {0: 202001,
  1: 202002,
  2: 202003,
  3: 202002,
  4: 202107,
  5: 202108,
  6: 202109},
 'state': {0: 'CA', 1: 'NJ', 2: 'NY', 3: 'NY', 4: 'PA', 5: 'PA', 6: 'PA'},
 'col1': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3, 5: 3, 6: 3},
 'col2': {0: 3, 1: 3, 2: 3, 3: 3, 4: 3, 5: 3, 6: 4},
 'col3': {0: 7, 1: 7, 2: 7, 3: 7, 4: 7, 5: 7, 6: 7}}

>>> hist_df.to_dict()
{'yyyy_mm': {0: 202101,
  1: 202002,
  2: 202001,
  3: 201901,
  4: 201907,
  5: 201908,
  6: 201901,
  7: 201907,
  8: 201908},
 'state': {0: 'CA',
  1: 'NJ',
  2: 'NY',
  3: 'NY',
  4: 'NY',
  5: 'NY',
  6: 'PA',
  7: 'PA',
  8: 'PA'},
 'col1': {0: 1, 1: 3, 2: 4, 3: 4, 4: 4, 5: 4, 6: 4, 7: 4, 8: 4},
 'col2': {0: 1, 1: 3, 2: 5, 3: 5, 4: 5, 5: 5, 6: 5, 7: 5, 8: 5},
 'col3': {0: 1, 1: 7, 2: 8, 3: 8, 4: 8, 5: 8, 6: 8, 7: 8, 8: 8}}

My current attempt:

picked_rows = src_df.loc[src_df.groupby('state')['yyyy_mm'].idxmax()]

>>> picked_rows.to_dict()
{'yyyy_mm': {0: 202001, 1: 202002, 2: 202003, 6: 202109},
 'state': {0: 'CA', 1: 'NJ', 2: 'NY', 6: 'PA'},
 'col1': {0: 3, 1: 3, 2: 3, 6: 3},
 'col2': {0: 3, 1: 3, 2: 3, 6: 4},
 'col3': {0: 7, 1: 7, 2: 7, 6: 7}}

Then I tried to do following but output is not same as my expected output:

output_df = pd.concat(picked_rows, hist_df, keys=['state', 'yyyy_mm'], axis=1)  # first attempt
output_df = pd.merge(picked_rows, hist_df, how='outer')   # second attempt

but both of those attempt not giving me my expected output. How should I get my desired output by comparing two dataframes where picked_rows should be append to hist_df by conditionally such as max('state', 'yyyy_mm'). How should we do this in pandas?

objective

I want to check picked_rows in hist_df where I need to check by state and yyyy_mm columns, so only add entries from picked_rows where state has max value or recent dates. I created desired output below. I tried inner join or pandas.concat but it is not giving me correct out. Does anyone have any ideas on this?

Here is my desired output that I want to get:

    yyyy_mm state  col1  col2  col3
0    202101    CA     1     1     1
1    202002    NJ     3     3     7
2    202001    NY     4     5     8
3    201901    NY     4     5     8
4    201907    NY     4     5     8
5    201908    NY     4     5     8
6    201901    PA     4     5     8
7    201907    PA     4     5     8
8    201908    PA     4     5     8
9    202003    NY     3     3     7
10   202109    PA     3     4     7
7
  • 1
    Provide sample of your input dataframes using src_df.head(10).to_dict() and hist_df.head(10).to_dict() Commented May 7, 2022 at 21:23
  • @not_speshal updated my post with src_df.to_dict(), hist_df.to_dict(), desired_output.to_dict(). Do you have any ideas how to get my expected output correctly in pandas? thanks Commented May 7, 2022 at 21:38
  • @not_speshal in my actual dataset, both src_df and hist_df has more than 100+ colunms. Can we do this only mapping state, yyyy_mm columns while keeping same logic here? any thoughts? Commented May 9, 2022 at 17:31
  • My answer does not reference any columns other than state and yyyy_mm. Please try the answer before asking follow-up questions. Commented May 9, 2022 at 17:51
  • @not_speshal my apology, I meant if I already took care of max in hist_df, so I tried this instead: new_data = src_df[src_df["yyyy_mm"].gt(src_df["state"].map(hist_df.groupby("state")["yyyy_mm"]))], just remove the max(), but by doing this I got NAN values in columns. Any idea? Commented May 9, 2022 at 18:08

1 Answer 1

2

You should change your picked_rows DataFrame to only include dates that are greater than the hist_df dates:

#keep only rows that are newer than in hist_df
new_data = src_df[src_df["yyyy_mm"].gt(src_df["state"].map(hist_df.groupby("state")["yyyy_mm"].max()))]

#of the new rows, keep the latest updated values
picked_rows = new_data.loc[new_data.groupby("state")["yyyy_mm"].idxmax()]

#concat to hist_df
output_df = pd.concat([hist_df, picked_rows], ignore_index=True)

>>> output_df
    yyyy_mm state  col1  col2  col3
0    202101    CA     1     1     1
1    202002    NJ     3     3     7
2    202001    NY     4     5     8
3    201901    NY     4     5     8
4    201907    NY     4     5     8
5    201908    NY     4     5     8
6    201901    PA     4     5     8
7    201907    PA     4     5     8
8    201908    PA     4     5     8
9    202003    NY     3     3     7
10   202109    PA     3     4     7
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.