I have a dataset :
id url keep_anyway field
1 A.com Yes X
2 A.com Yes Y
3 B.com No Y
4 B.com No X
5 C.com No X
I want to remove "url" duplicates with conditions :
- Keep duplicates if "keep_anyway" = "Yes".
- For duplicates with "keep_anyway" = "No", I want to keep the row with "X" value in "field" column.
Expected output is :
id url keep_anyway field
1 A.com Yes X
2 A.com Yes Y
4 B.com No X
5 C.com No X
I have been able to manage condition 1 with :
df.loc[(df['keep_aanyway'] =='Yes') | ~df['url'].duplicated()]
But how to set up Condition 2 ?
Note that possible values of "field" column are either X or Y, and if I have duplicates, I know FOR SURE that I have one "X" and one "Y" value.
I thought maybe I could sort from A to Z in "field" column then have "keep_first"=True in df.duplicated, but I think it is deprecated, isn't it ?