0

I have a dataset :

id    url     keep_anyway  field
1     A.com   Yes          X
2     A.com   Yes          Y
3     B.com   No           Y
4     B.com   No           X
5     C.com   No           X 

I want to remove "url" duplicates with conditions :

  1. Keep duplicates if "keep_anyway" = "Yes".
  2. For duplicates with "keep_anyway" = "No", I want to keep the row with "X" value in "field" column.

Expected output is :

id    url     keep_anyway  field
1     A.com   Yes          X
2     A.com   Yes          Y
4     B.com   No           X
5     C.com   No           X 

I have been able to manage condition 1 with :

df.loc[(df['keep_aanyway'] =='Yes') | ~df['url'].duplicated()]

But how to set up Condition 2 ?

Note that possible values of "field" column are either X or Y, and if I have duplicates, I know FOR SURE that I have one "X" and one "Y" value.

I thought maybe I could sort from A to Z in "field" column then have "keep_first"=True in df.duplicated, but I think it is deprecated, isn't it ?

1 Answer 1

2

Try this:

import numpy as np

duplicates = df.duplicated(subset='url')
keep_anyway_bool = df['keep_away'] == 'Yes' # (credit @acushner for pointing this out)
field_bool = df['field'] == 'X'  # (credit @acushner for pointing this out)

df[np.invert(duplicates) | keep_anyway_bool | field_bool]
Sign up to request clarification or add additional context in comments.

4 Comments

do you need the np.where there? can't you just do field_bool = df.field == 'X'?
Yes, I was initially trying to get everything in the same line, hence the np.where. I was going to nest them, then realized that the conditions need to be 'or'ed together...
NameError: name 'np' is not defined , why is that ?
import numpy as np should cure that. Just use ~duplicates instead...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.