Removing Dudplicate Rows based on two columns

Question

I have a dataframe which contains duplicates values according to four columns (SFDC_ID and left_side and right_SFDC_ID and right_side and similairity):

Right now SFDC_ID and right_SFDC_ID are duplicating in the following way:

SFDC_ID left_side   right_SFDC_ID   right_side  similairity

0013s00000vEVuwAAG  Hague Quality Water 0013s00000vEW72AAG  Hague Quality Waters    0.99023304
0013s00000vEW72AAG  Hague Quality Waters    0013s00000vEVuwAAG  Hague Quality Water 0.99023304

If you look closely, the SFDC_ID of row 1 is the same as right_SFDC_ID of row 2.

How would I drop the second-row using pandas?

i'd suggest you format ur data a bit better, cos at the moment one cant tell if Haque quality waters is a column on its own or combined with 0013... — sammywemmy
– sammywemmy, Commented Feb 27, 2020 at 22:54

Sameeresque · Accepted Answer · 2020-02-28 02:48:52Z

2

Here's one way:

# compares which is greater based on alphabetical order and makes a bool series
mask = df['SFDC_ID'] < df['right_SFDC_ID'] 

# creates a new column checking True vs False, 

#if mask is true item in df['SFDC_ID'] is selected else item in df['right_SFDC_ID'] is selected

df['col1'] = df['SFDC_ID'].where(mask, df['right_SFDC_ID'])

#same as above but a column for df['right_SFDC_ID']
df['col2'] = df['right_SFDC_ID'].where(mask, df['SFDC_ID'])

# checks for duplicates in `col1` and `col2` and removes last duplicate
df = df.drop_duplicates(subset=['col1', 'col2'])

edited Feb 28, 2020 at 2:48

answered Feb 27, 2020 at 23:04

Sameeresque

2,6022 gold badges14 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Matthew Metros Over a year ago

could you explain what this exactly does?

Matthew Metros Over a year ago

matches_df['SFDC_ID'] < matches_df['right_SFDC_ID'] What does this exactly do?

Sameeresque Over a year ago

Did you find my commented answer helpful?

Matthew Metros Over a year ago

Its not working for me - I am still finding the same issue

Sameeresque Over a year ago

Based on the example you've given, I understand that you want the second row to be removed because 'SFDC_ID' or 'right_SFDC_ID' are the same. The lines that I've put above does exactly that. The code checks if both of those columns duplicate (in any order) and keeps only the first occurence.

|

Ali · Accepted Answer · 2020-02-27 23:16:48Z

0

You could iterate over the rows and drop rows where the previous rows value matches

for index,row in df[1::].iterrows():
    prev_SFDC_ID = df.iloc[index-1]['SFDC_ID'] #get prev SFDC_ID value
    if row['right_SFDC_ID'] == prev_SFDC_ID: 
        df.drop(index=index, inplace=True)

answered Feb 27, 2020 at 23:16

Ali

3383 silver badges8 bronze badges

Collectives™ on Stack Overflow

Removing Dudplicate Rows based on two columns

2 Answers 2

6 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related