Pandas: Find rows which don't exist in another DataFrame by multiple columns

Question

same as this python pandas: how to find rows in one dataframe but not in another? but with multiple columns

This is the setup:

import pandas as pd

df = pd.DataFrame(dict(
    col1=[0,1,1,2],
    col2=['a','b','c','b'],
    extra_col=['this','is','just','something']
))

other = pd.DataFrame(dict(
    col1=[1,2],
    col2=['b','c']
))

Now, I want to select the rows from df which don't exist in other. I want to do the selection by col1 and col2

In SQL I would do:

select * from df 
where not exists (
    select * from other o 
    where df.col1 = o.col1 and 
    df.col2 = o.col2
)

And in Pandas I can do something like this but it feels very ugly. Part of the ugliness could be avoided if df had id-column but it's not always available.

key_col = ['col1','col2']
df_with_idx = df.reset_index()
common = pd.merge(df_with_idx,other,on=key_col)['index']
mask = df_with_idx['index'].isin(common)

desired_result =  df_with_idx[~mask].drop('index',axis=1)

So maybe there is some more elegant way?

EdChum · Accepted Answer · 2016-02-12 19:25:05Z

54

Since 0.17.0 there is a new indicator param you can pass to merge which will tell you whether the rows are only present in left, right or both:

In [5]:
merged = df.merge(other, how='left', indicator=True)
merged

Out[5]:
   col1 col2  extra_col     _merge
0     0    a       this  left_only
1     1    b         is       both
2     1    c       just  left_only
3     2    b  something  left_only

In [6]:    
merged[merged['_merge']=='left_only']

Out[6]:
   col1 col2  extra_col     _merge
0     0    a       this  left_only
2     1    c       just  left_only
3     2    b  something  left_only

So you can now filter the merged df by selecting only 'left_only' rows

answered Feb 12, 2016 at 19:25

EdChum

397k204 gold badges836 silver badges583 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Pekka Over a year ago

Thanks for coming back to this. You could do this in one line with df.merge(other, how='left', indicator=True).query('_merge == "left_only"') but don't know if that's any better.

EdChum Over a year ago

Personally I find too much chaining for the sake of producing a one liner can make the code more difficult to read, there may be some speed and memory improvements though

SpeedCoder5 Over a year ago

@Pekka: + to get back to original left in one line: df.merge(other, how='left', indicator=True).query('_merge == "left_only"').drop(['_merge'],axis=1)

Nirmal Over a year ago

I want to to add that merged[merged['_merge']=='left_only'] will not show the desired result unless we write to a file and then read. At least for me that's how it turned out to be. I had to write to a .csv and read into a new pandas df to see that the left_only rows.

greg_data · Accepted Answer · 2015-09-18 13:43:16Z

6

Interesting

cols = ['col1','col2']
#get copies where the indeces are the columns of interest
df2 = df.set_index(cols)
other2 = other.set_index(cols)
#Look for index overlap, ~
df[~df2.index.isin(other2.index)]

Returns:

    col1 col2  extra_col
0     0    a       this
2     1    c       just
3     2    b  something

Seems a little bit more elegant...

answered Sep 18, 2015 at 13:43

greg_data

2,29414 silver badges20 bronze badges

1 Comment

EdChum Over a year ago

If you set the index to those cols you can use difference to achieve the same result

Collectives™ on Stack Overflow

Pandas: Find rows which don't exist in another DataFrame by multiple columns

2 Answers 2

4 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related