remove duplicate rows based on specific criteria with pandas

Question

I have the dataframe below:

   Id   Name    Sales   Rent    Rate
40808   A2      0       43      340
17486   DV      491     0       346
17486   D       0       0       0
27977   AM      0       0       0
27977   A-M     0       0       94
80210   O-9     0       0       0
80210   M-1     0       0       -37
15545   M-2     0       0       -17
15545   O-8     0       0       0
53549   A-M7    0       0       0
53549   A-M8    0       0       50
40808   A       0       0       0
66666   MK      0       0       0

I want to remove duplicate rows based on Id values(exp 40808) and to keep only the row that don't have 0 value in all the fields. I used the suggestion from the answer:

df['zero']=df.select_dtypes(['int','float']).eq(0).sum(axis=1)
df=df.sort_values(['zero','Id']).drop_duplicates(subset=['Id']).drop(columns='zero')

The output i got

      Id  Name  Sales  Rent     Rate
  40808    A2      0     43      340
  53549  A-M7      0      0        0
  27977    AM      0      0        0
  17486     D      0      0        0
  80210   M-1      0      0       -37
  15545   M-2      0      0       -17
   66666   MK       0      0        0

Expected output:

Id      Name    Sales   Rent    Rate
40808   A2      0       43      340
17486   DV      491     0       346
27977   A-M     0       0       94
80210   M-1     0       0       -37
15545   M-2     0       0       -17
53549   A-M8    0       0       50
66666   MK      0       0        0

You want to remove duplicate rows based on Id values but on your expected output I can see 4567 two times. Also you have "E" on the expected outpout whereas it wasn't present on the original dataframe — Kben59
– Kben59, Commented Aug 11, 2021 at 18:45
@mozway nothing works, I update the question can u check please the output i got vs the output expected? — Learner
– Learner, Commented Aug 16, 2021 at 9:00

ThePyGuy · Accepted Answer · 2021-08-17 08:20:52Z

4

+50

First create a masking to separate duplicate and non-duplicate rows based on Id, then concatenate non-duplicate slice with duplicate slice without all row values equal to 0.

>>> duplicateMask = df.duplicated('Id', keep=False)
>>> pd.concat([df.loc[duplicateMask & df[['Sales', 'Rent', 'Rate']].ne(0).any(axis=1)],
               df[~duplicateMask]])
       Id  Name  Sales  Rent  Rate
0   40808    A2      0    43   340
1   17486    DV    491     0   346
4   27977   A-M      0     0    94
6   80210   M-1      0     0   -37
7   15545   M-2      0     0   -17
10  53549  A-M8      0     0    50
12  66666    MK      0     0     0

edited Aug 17, 2021 at 8:20

answered Aug 16, 2021 at 20:21

ThePyGuy

18.5k5 gold badges24 silver badges55 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Learner Over a year ago

No actually i have some rows that all have zeros (but not with duplicate Id) i dont want to drop all the rows that have zeros but only those with duplicated Id

Learner Over a year ago

If i have 40808 A2 0 43 340 & 40808 A 0 0 0 I want to keep only the 40808 A2 0 43 340 and the i can not sort with 0 cause i have negative values

ThePyGuy Over a year ago

Let me know if the updated solution works for you.

Anurag Dabas · Accepted Answer · 2021-08-12 13:08:49Z

2

another way is to count the number of zero and negative numbers and then sort according to it and then drop duplicate values and finally remove the 'zero' column:

df['zero']=df.select_dtypes(['int','float']).eq(0).sum(axis=1)
df=df.sort_values(['zero','id']).drop_duplicates(subset=['id']).drop(columns='zero')

edited Aug 12, 2021 at 13:08

answered Aug 11, 2021 at 17:11

Anurag Dabas

24.3k9 gold badges25 silver badges41 bronze badges

9 Comments

Learner Over a year ago

Keep = ‘last’ will remove my first row

Anurag Dabas Over a year ago

@Prestige Nope it doesn't since we sorted the values so order is changed so check again(I added output first row is at index position 3)

Learner Over a year ago

Thanks, but this a mini exp of my dataframe i tried with this before posting in stackoverflow and it didn't work i need to remove the rows based on conditions rent & sale = 0

Anurag Dabas Over a year ago

@Prestige sort your dataframe and let drop_duplicates to drop those rows(if they are duplicated) also as this is mini sample but if there are more numeric columns in your actual dataframe then also include those columns in sort_values()

Learner Over a year ago

this is not working because i have negative values that's why keep=last is not working for me. So when i order the df with sales and rent the negative become first

|

Corralien · Accepted Answer · 2021-08-16 20:32:22Z

1

The problem on your sample is once you have remove rows with all zeros on columns [Sales, Rent, Rate], there are no more duplicate values.

I want to remove duplicate rows based on Id values(exp 40808) and to keep only the row that don't have 0 value in all the fields.

You should reverse the logic:

I want to keep only the row that don't have 0 value in all the fields and (then) remove duplicate rows based on Id values (exp 40808).

>>> df[df[['Sales', 'Rent', 'Rate']].eq(0).all(axis=1)].drop_duplicates('Id')

       Id  Name  Sales  Rent  Rate
2   17486     D      0     0     0
3   27977    AM      0     0     0
5   80210   O-9      0     0     0
8   15545   O-8      0     0     0
9   53549  A-M7      0     0     0
11  40808     A      0     0     0

answered Aug 16, 2021 at 20:32

Corralien

121k8 gold badges43 silver badges68 bronze badges

1 Comment

Learner Over a year ago

No, i want to remove duplicate rows based on Id value and keep only one row with the id duplicated (the rows that i want to keep is with values # 0)

mozway · Accepted Answer · 2021-08-17 09:18:11Z

Here is a working solution. It first splits the data in two. The rows that we keep for sure and the whole dataframe in which the rows to "keep-for-sure" are labeled with a NaN. Then we drop duplicates in this latter subset to ensure that one 0 containing row is kept only when a "keep-for-sure" row is not present. Finally we merge both subsets after dropping the "keep-for-sure" rows from the second subset.

cond = df[['rent', 'sale', 'Rate']].ne(0).any(axis=1)   # rows to keep for sure

pd.concat([df[cond],
           (df.assign(Name=df['Name'].where(~cond, float('nan')))   # flag keep-for-sure
              .loc[cond.sort_values().index]   # sort so that keep-for-sure are last
              .drop_duplicates(subset='id', keep='last')   # keep 0s row only if no keep-for-sure in group 
              .dropna(subset=['Name'])
            )
          ])

output:

     id Name  rent  sale
0  2340    A   180   -10
4  4467    F   180     5
5  2467    C    20    45
7  4567    w    12    76
1  1002    B     0     0

KingOtto · Accepted Answer · 2021-08-17 15:16:34Z

This works, proceeding in 2 steps:

# Step 1 - collect all rows that are *not* duplicates (based on ID)
non_duplicates_to_keep = df.drop_duplicates(subset='Id', keep=False)

# Step 2a - identify *all* rows that have duplicates (based on ID, keep all)
sub_df = df[df.duplicated('Id', keep=False)]

# Step 2b - of those duplicates, discard all that have "0" in any of the numeric columns
duplicates_to_keep = sub_df[(sub_df[sub_df._get_numeric_data().columns[1:]] != 0).sum(axis=1) > 0]

# join the 2 sets
pd.concat([non_duplicates_to_keep, duplicates_to_keep])

Beware, though: What you are asking will again lead to duplicates (as per your question): If you have duplicates (imagine, 4 lines, of which 2 have non-zero values), you will end up with a duplicate Id again after cleaing: 2 rows get removed because only 0, 2 remaining as they are non-zero.

Not the case in your dummy data, but happesn in general... Is this really what you are after?

Collectives™ on Stack Overflow

remove duplicate rows based on specific criteria with pandas

5 Answers 5

3 Comments

9 Comments

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

9 Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related