Remove duplicate rows but with condition

Question

I have a data frame that looks something like:

df =

date         col1    col2    col3    col4
-----------------------------------------
2022/30/01   2       2       4       5
2022/30/01   2       2       4       5
2022/30/01   0       0       1       2
2022/30/01   0       0       1       2
2022/30/01   3       2       4       2
2022/30/01   5       8       4       3

So basically I have the first two rows being identical, the next two rows also being identical, and the last two different.

What I would like to do is to remove duplicate rows, but only those rows where col1 and col2 equals 0, i.e. the resulting data frame should be:

df_final =

date         col1    col2    col3    col4
-----------------------------------------
2022/30/01   2       2       4       5
2022/30/01   2       2       4       5
2022/30/01   0       0       1       2
2022/30/01   3       2       4       2
2022/30/01   5       8       4       3

Is there any way to accomplish this in an easy way ? I know I could probably do some kind of sorting on the data frame, and then loop through each row and check for conditions. I just suspect that could be a rather time consuming process if there are a lot of rows.

mozway · Accepted Answer · 2022-02-14 11:54:28Z

5

IIUC, this is a simple selection by boolean masks using duplicated to find the duplicated rows and ne+all to filter the 0 values:

# is the row not a duplicate?
mask1 = ~df.duplicated()
# are col1 and col2 not both 0?
mask2 = df[['col1', 'col2']].ne(0).all(axis=1)
# then keep the data on either of the above conditions
df2 = df[mask1|mask2]

output:

         date  col1  col2  col3  col4
0  2022/30/01     2     2     4     5
1  2022/30/01     2     2     4     5
2  2022/30/01     0     0     1     2
4  2022/30/01     3     2     4     2
5  2022/30/01     5     8     4     3

Alternative with the reverse operations (eq+all)

# Is the row duplicated?
m1 = df.duplicated()
# Are both cols equal to 0?
m2 = df[['col1', 'col2']].eq(0).all(1)
# then keep if not both conditions are met
df[~(m1&m2)]

edited Feb 14, 2022 at 11:54

answered Feb 14, 2022 at 11:46

mozway

267k13 gold badges56 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

jezrael · Accepted Answer · 2022-02-14 11:50:48Z

4

You can filter in boolean indexing by chain 2 conditions - find rows without 0 rows and then add first duplicated rows by DataFrame.duplicated, by default are tested all columns, for chain is used | for bitwise OR:

df_final = df[df[['col1','col2']].ne(0).all(axis=1) | ~df.duplicated()]
print (df_final)
         date  col1  col2  col3  col4
0  2022/30/01     2     2     4     5
1  2022/30/01     2     2     4     5
2  2022/30/01     0     0     1     2
4  2022/30/01     3     2     4     2
5  2022/30/01     5     8     4     3

Details:

print (df[['col1','col2']].ne(0).all(axis=1))
0     True
1     True
2    False
3    False
4     True
5     True
dtype: bool

print (~df.duplicated())
0     True
1    False
2     True
3    False
4     True
5     True
dtype: bool

edited Feb 14, 2022 at 11:50

answered Feb 14, 2022 at 11:45

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

1 Comment

mozway Over a year ago

I think you should use ~df.duplicated(), if you have more that 2 duplicates this will keep them

G.G · Accepted Answer · 2022-11-10 07:59:14Z

0

df1.drop(index=df1.query("col1==0 and col2==0")
         .duplicated().loc[lambda x:x].index)

      date  col1  col2  col3  col4
0  2022/30/01     2     2     4     5
1  2022/30/01     2     2     4     5
2  2022/30/01     0     0     1     2
4  2022/30/01     3     2     4     2
5  2022/30/01     5     8     4     3

answered Nov 10, 2022 at 7:59

G.G

7654 silver badges5 bronze badges

Collectives™ on Stack Overflow

Remove duplicate rows but with condition

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related