3

I have the dataframe below:

   Id   Name    Sales   Rent    Rate
40808   A2      0       43      340
17486   DV      491     0       346
17486   D       0       0       0
27977   AM      0       0       0
27977   A-M     0       0       94
80210   O-9     0       0       0
80210   M-1     0       0       -37
15545   M-2     0       0       -17
15545   O-8     0       0       0
53549   A-M7    0       0       0
53549   A-M8    0       0       50
40808   A       0       0       0
66666   MK      0       0       0
    

I want to remove duplicate rows based on Id values(exp 40808) and to keep only the row that don't have 0 value in all the fields. I used the suggestion from the answer:

df['zero']=df.select_dtypes(['int','float']).eq(0).sum(axis=1)
df=df.sort_values(['zero','Id']).drop_duplicates(subset=['Id']).drop(columns='zero')

The output i got

      Id  Name  Sales  Rent     Rate
  40808    A2      0     43      340
  53549  A-M7      0      0        0
  27977    AM      0      0        0
  17486     D      0      0        0
  80210   M-1      0      0       -37
  15545   M-2      0      0       -17
   66666   MK       0      0        0

Expected output:

Id      Name    Sales   Rent    Rate
40808   A2      0       43      340
17486   DV      491     0       346
27977   A-M     0       0       94
80210   M-1     0       0       -37
15545   M-2     0       0       -17
53549   A-M8    0       0       50
66666   MK      0       0        0
2
  • You want to remove duplicate rows based on Id values but on your expected output I can see 4567 two times. Also you have "E" on the expected outpout whereas it wasn't present on the original dataframe Commented Aug 11, 2021 at 18:45
  • @mozway nothing works, I update the question can u check please the output i got vs the output expected? Commented Aug 16, 2021 at 9:00

5 Answers 5

4
+50

First create a masking to separate duplicate and non-duplicate rows based on Id, then concatenate non-duplicate slice with duplicate slice without all row values equal to 0.

>>> duplicateMask = df.duplicated('Id', keep=False)
>>> pd.concat([df.loc[duplicateMask & df[['Sales', 'Rent', 'Rate']].ne(0).any(axis=1)],
               df[~duplicateMask]])
       Id  Name  Sales  Rent  Rate
0   40808    A2      0    43   340
1   17486    DV    491     0   346
4   27977   A-M      0     0    94
6   80210   M-1      0     0   -37
7   15545   M-2      0     0   -17
10  53549  A-M8      0     0    50
12  66666    MK      0     0     0
Sign up to request clarification or add additional context in comments.

3 Comments

No actually i have some rows that all have zeros (but not with duplicate Id) i dont want to drop all the rows that have zeros but only those with duplicated Id
If i have 40808 A2 0 43 340 & 40808 A 0 0 0 I want to keep only the 40808 A2 0 43 340 and the i can not sort with 0 cause i have negative values
Let me know if the updated solution works for you.
2

another way is to count the number of zero and negative numbers and then sort according to it and then drop duplicate values and finally remove the 'zero' column:

df['zero']=df.select_dtypes(['int','float']).eq(0).sum(axis=1)
df=df.sort_values(['zero','id']).drop_duplicates(subset=['id']).drop(columns='zero')

9 Comments

Keep = ‘last’ will remove my first row
@Prestige Nope it doesn't since we sorted the values so order is changed so check again(I added output first row is at index position 3)
Thanks, but this a mini exp of my dataframe i tried with this before posting in stackoverflow and it didn't work i need to remove the rows based on conditions rent & sale = 0
@Prestige sort your dataframe and let drop_duplicates to drop those rows(if they are duplicated) also as this is mini sample but if there are more numeric columns in your actual dataframe then also include those columns in sort_values()
this is not working because i have negative values that's why keep=last is not working for me. So when i order the df with sales and rent the negative become first
|
1

The problem on your sample is once you have remove rows with all zeros on columns [Sales, Rent, Rate], there are no more duplicate values.

I want to remove duplicate rows based on Id values(exp 40808) and to keep only the row that don't have 0 value in all the fields.

You should reverse the logic:

I want to keep only the row that don't have 0 value in all the fields and (then) remove duplicate rows based on Id values (exp 40808).

>>> df[df[['Sales', 'Rent', 'Rate']].eq(0).all(axis=1)].drop_duplicates('Id')

       Id  Name  Sales  Rent  Rate
2   17486     D      0     0     0
3   27977    AM      0     0     0
5   80210   O-9      0     0     0
8   15545   O-8      0     0     0
9   53549  A-M7      0     0     0
11  40808     A      0     0     0

1 Comment

No, i want to remove duplicate rows based on Id value and keep only one row with the id duplicated (the rows that i want to keep is with values # 0)
1

Here is a working solution. It first splits the data in two. The rows that we keep for sure and the whole dataframe in which the rows to "keep-for-sure" are labeled with a NaN. Then we drop duplicates in this latter subset to ensure that one 0 containing row is kept only when a "keep-for-sure" row is not present. Finally we merge both subsets after dropping the "keep-for-sure" rows from the second subset.

cond = df[['rent', 'sale', 'Rate']].ne(0).any(axis=1)   # rows to keep for sure

pd.concat([df[cond],
           (df.assign(Name=df['Name'].where(~cond, float('nan')))   # flag keep-for-sure
              .loc[cond.sort_values().index]   # sort so that keep-for-sure are last
              .drop_duplicates(subset='id', keep='last')   # keep 0s row only if no keep-for-sure in group 
              .dropna(subset=['Name'])
            )
          ])

output:

     id Name  rent  sale
0  2340    A   180   -10
4  4467    F   180     5
5  2467    C    20    45
7  4567    w    12    76
1  1002    B     0     0

Comments

0

This works, proceeding in 2 steps:

# Step 1 - collect all rows that are *not* duplicates (based on ID)
non_duplicates_to_keep = df.drop_duplicates(subset='Id', keep=False)

# Step 2a - identify *all* rows that have duplicates (based on ID, keep all)
sub_df = df[df.duplicated('Id', keep=False)]

# Step 2b - of those duplicates, discard all that have "0" in any of the numeric columns
duplicates_to_keep = sub_df[(sub_df[sub_df._get_numeric_data().columns[1:]] != 0).sum(axis=1) > 0]

# join the 2 sets
pd.concat([non_duplicates_to_keep, duplicates_to_keep])

Beware, though: What you are asking will again lead to duplicates (as per your question): If you have duplicates (imagine, 4 lines, of which 2 have non-zero values), you will end up with a duplicate Id again after cleaing: 2 rows get removed because only 0, 2 remaining as they are non-zero.

Not the case in your dummy data, but happesn in general... Is this really what you are after?

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.