Pandas : in case of duplicate values, remove the row with a particular value in another column

Question

I have a dataset :

id    url     keep_anyway  field
1     A.com   Yes          X
2     A.com   Yes          Y
3     B.com   No           Y
4     B.com   No           X
5     C.com   No           X

I want to remove "url" duplicates with conditions :

Keep duplicates if "keep_anyway" = "Yes".
For duplicates with "keep_anyway" = "No", I want to keep the row with "X" value in "field" column.

Expected output is :

id    url     keep_anyway  field
1     A.com   Yes          X
2     A.com   Yes          Y
4     B.com   No           X
5     C.com   No           X

I have been able to manage condition 1 with :

df.loc[(df['keep_aanyway'] =='Yes') | ~df['url'].duplicated()]

But how to set up Condition 2 ?

Note that possible values of "field" column are either X or Y, and if I have duplicates, I know FOR SURE that I have one "X" and one "Y" value.

I thought maybe I could sort from A to Z in "field" column then have "keep_first"=True in df.duplicated, but I think it is deprecated, isn't it ?

Vincent · Accepted Answer · 2016-08-10 08:09:06Z

2

Try this:

import numpy as np

duplicates = df.duplicated(subset='url')
keep_anyway_bool = df['keep_away'] == 'Yes' # (credit @acushner for pointing this out)
field_bool = df['field'] == 'X'  # (credit @acushner for pointing this out)

df[np.invert(duplicates) | keep_anyway_bool | field_bool]

edited Aug 10, 2016 at 8:09

Vincent

1,6243 gold badges23 silver badges46 bronze badges

answered Aug 9, 2016 at 22:22

Kartik

8,73345 silver badges78 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

acushner Over a year ago

do you need the np.where there? can't you just do field_bool = df.field == 'X'?

Kartik Over a year ago

Yes, I was initially trying to get everything in the same line, hence the np.where. I was going to nest them, then realized that the conditions need to be 'or'ed together...

Vincent Over a year ago

NameError: name 'np' is not defined , why is that ?

Kartik Over a year ago

import numpy as np should cure that. Just use ~duplicates instead...

Collectives™ on Stack Overflow

Pandas : in case of duplicate values, remove the row with a particular value in another column

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related