33

same as this python pandas: how to find rows in one dataframe but not in another? but with multiple columns

This is the setup:

import pandas as pd

df = pd.DataFrame(dict(
    col1=[0,1,1,2],
    col2=['a','b','c','b'],
    extra_col=['this','is','just','something']
))

other = pd.DataFrame(dict(
    col1=[1,2],
    col2=['b','c']
))

Now, I want to select the rows from df which don't exist in other. I want to do the selection by col1 and col2

In SQL I would do:

select * from df 
where not exists (
    select * from other o 
    where df.col1 = o.col1 and 
    df.col2 = o.col2
)

And in Pandas I can do something like this but it feels very ugly. Part of the ugliness could be avoided if df had id-column but it's not always available.

key_col = ['col1','col2']
df_with_idx = df.reset_index()
common = pd.merge(df_with_idx,other,on=key_col)['index']
mask = df_with_idx['index'].isin(common)

desired_result =  df_with_idx[~mask].drop('index',axis=1)

So maybe there is some more elegant way?

2 Answers 2

54

Since 0.17.0 there is a new indicator param you can pass to merge which will tell you whether the rows are only present in left, right or both:

In [5]:
merged = df.merge(other, how='left', indicator=True)
merged

Out[5]:
   col1 col2  extra_col     _merge
0     0    a       this  left_only
1     1    b         is       both
2     1    c       just  left_only
3     2    b  something  left_only

In [6]:    
merged[merged['_merge']=='left_only']

Out[6]:
   col1 col2  extra_col     _merge
0     0    a       this  left_only
2     1    c       just  left_only
3     2    b  something  left_only

So you can now filter the merged df by selecting only 'left_only' rows

Sign up to request clarification or add additional context in comments.

4 Comments

Thanks for coming back to this. You could do this in one line with df.merge(other, how='left', indicator=True).query('_merge == "left_only"') but don't know if that's any better.
Personally I find too much chaining for the sake of producing a one liner can make the code more difficult to read, there may be some speed and memory improvements though
@Pekka: + to get back to original left in one line: df.merge(other, how='left', indicator=True).query('_merge == "left_only"').drop(['_merge'],axis=1)
I want to to add that merged[merged['_merge']=='left_only'] will not show the desired result unless we write to a file and then read. At least for me that's how it turned out to be. I had to write to a .csv and read into a new pandas df to see that the left_only rows.
6

Interesting

cols = ['col1','col2']
#get copies where the indeces are the columns of interest
df2 = df.set_index(cols)
other2 = other.set_index(cols)
#Look for index overlap, ~
df[~df2.index.isin(other2.index)]

Returns:

    col1 col2  extra_col
0     0    a       this
2     1    c       just
3     2    b  something

Seems a little bit more elegant...

1 Comment

If you set the index to those cols you can use difference to achieve the same result

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.