0

I have a dataframe which contains duplicates values according to four columns (SFDC_ID and left_side and right_SFDC_ID and right_side and similairity):

Right now SFDC_ID and right_SFDC_ID are duplicating in the following way:

SFDC_ID left_side   right_SFDC_ID   right_side  similairity

0013s00000vEVuwAAG  Hague Quality Water 0013s00000vEW72AAG  Hague Quality Waters    0.99023304
0013s00000vEW72AAG  Hague Quality Waters    0013s00000vEVuwAAG  Hague Quality Water 0.99023304

If you look closely, the SFDC_ID of row 1 is the same as right_SFDC_ID of row 2.

How would I drop the second-row using pandas?

2
  • i'd suggest you format ur data a bit better, cos at the moment one cant tell if Haque quality waters is a column on its own or combined with 0013... Commented Feb 27, 2020 at 22:54
  • Format it better in Stackoverflow? I believe I updated this Commented Feb 27, 2020 at 23:03

2 Answers 2

2

Here's one way:

# compares which is greater based on alphabetical order and makes a bool series
mask = df['SFDC_ID'] < df['right_SFDC_ID'] 

# creates a new column checking True vs False, 

#if mask is true item in df['SFDC_ID'] is selected else item in df['right_SFDC_ID'] is selected

df['col1'] = df['SFDC_ID'].where(mask, df['right_SFDC_ID'])

#same as above but a column for df['right_SFDC_ID']
df['col2'] = df['right_SFDC_ID'].where(mask, df['SFDC_ID'])

# checks for duplicates in `col1` and `col2` and removes last duplicate
df = df.drop_duplicates(subset=['col1', 'col2'])
Sign up to request clarification or add additional context in comments.

6 Comments

could you explain what this exactly does?
matches_df['SFDC_ID'] < matches_df['right_SFDC_ID'] What does this exactly do?
Did you find my commented answer helpful?
Its not working for me - I am still finding the same issue
Based on the example you've given, I understand that you want the second row to be removed because 'SFDC_ID' or 'right_SFDC_ID' are the same. The lines that I've put above does exactly that. The code checks if both of those columns duplicate (in any order) and keeps only the first occurence.
|
0

You could iterate over the rows and drop rows where the previous rows value matches

for index,row in df[1::].iterrows():
    prev_SFDC_ID = df.iloc[index-1]['SFDC_ID'] #get prev SFDC_ID value
    if row['right_SFDC_ID'] == prev_SFDC_ID: 
        df.drop(index=index, inplace=True)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.