3

I have a data frame that looks something like:

df =

date         col1    col2    col3    col4
-----------------------------------------
2022/30/01   2       2       4       5
2022/30/01   2       2       4       5
2022/30/01   0       0       1       2
2022/30/01   0       0       1       2
2022/30/01   3       2       4       2
2022/30/01   5       8       4       3

So basically I have the first two rows being identical, the next two rows also being identical, and the last two different.

What I would like to do is to remove duplicate rows, but only those rows where col1 and col2 equals 0, i.e. the resulting data frame should be:

df_final =

date         col1    col2    col3    col4
-----------------------------------------
2022/30/01   2       2       4       5
2022/30/01   2       2       4       5
2022/30/01   0       0       1       2
2022/30/01   3       2       4       2
2022/30/01   5       8       4       3

Is there any way to accomplish this in an easy way ? I know I could probably do some kind of sorting on the data frame, and then loop through each row and check for conditions. I just suspect that could be a rather time consuming process if there are a lot of rows.

3 Answers 3

5

IIUC, this is a simple selection by boolean masks using duplicated to find the duplicated rows and ne+all to filter the 0 values:

# is the row not a duplicate?
mask1 = ~df.duplicated()
# are col1 and col2 not both 0?
mask2 = df[['col1', 'col2']].ne(0).all(axis=1)
# then keep the data on either of the above conditions
df2 = df[mask1|mask2]

output:

         date  col1  col2  col3  col4
0  2022/30/01     2     2     4     5
1  2022/30/01     2     2     4     5
2  2022/30/01     0     0     1     2
4  2022/30/01     3     2     4     2
5  2022/30/01     5     8     4     3

Alternative with the reverse operations (eq+all)

# Is the row duplicated?
m1 = df.duplicated()
# Are both cols equal to 0?
m2 = df[['col1', 'col2']].eq(0).all(1)
# then keep if not both conditions are met
df[~(m1&m2)]
Sign up to request clarification or add additional context in comments.

Comments

4

You can filter in boolean indexing by chain 2 conditions - find rows without 0 rows and then add first duplicated rows by DataFrame.duplicated, by default are tested all columns, for chain is used | for bitwise OR:

df_final = df[df[['col1','col2']].ne(0).all(axis=1) | ~df.duplicated()]
print (df_final)
         date  col1  col2  col3  col4
0  2022/30/01     2     2     4     5
1  2022/30/01     2     2     4     5
2  2022/30/01     0     0     1     2
4  2022/30/01     3     2     4     2
5  2022/30/01     5     8     4     3

Details:

print (df[['col1','col2']].ne(0).all(axis=1))
0     True
1     True
2    False
3    False
4     True
5     True
dtype: bool

print (~df.duplicated())
0     True
1    False
2     True
3    False
4     True
5     True
dtype: bool

1 Comment

I think you should use ~df.duplicated(), if you have more that 2 duplicates this will keep them
0
df1.drop(index=df1.query("col1==0 and col2==0")
         .duplicated().loc[lambda x:x].index)

      date  col1  col2  col3  col4
0  2022/30/01     2     2     4     5
1  2022/30/01     2     2     4     5
2  2022/30/01     0     0     1     2
4  2022/30/01     3     2     4     2
5  2022/30/01     5     8     4     3

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.