Python Pandas: how to find rows in one dataframe but not in another?

Question

Let's say that I have two tables: people_all and people_usa, both with the same structure and therefore the same primary key.

How can I get a table of the people not in the USA? In SQL I'd do something like:

select a.*
from people_all a

left outer join people_usa u
on a.id = u.id

where u.id is null

What would be the Python equivalent? I cannot think of a way to translate this where statement into Pandas syntax.

The only way I can think of is to add an arbitrary field to people_usa (e.g. people_usa['dummy']=1), do a left join, then take only the records where 'dummy' is nan, then delete the dummy field - which seems a bit convoluted.

wouldnt a people_all_set.difference(people_usa_set) do the trick? — Lawrence Benson
– Lawrence Benson, Commented Sep 18, 2015 at 12:19
Does this work only on the index of the dataframe? I'd like the option to specify the field(s) to apply this to — Pythonista anonymous
– Pythonista anonymous, Commented Sep 18, 2015 at 12:22
@LawrenceBenson difference operates on indexes so it would need to be people_all_set.index.difference(people_usa_set.index) pandas.pydata.org/docs/reference/api/… — philipnye
– philipnye, Commented Jan 28, 2023 at 15:01

EdChum · Accepted Answer · 2015-09-18 12:22:21Z

29

use isin and negate the boolean mask:

people_usa[~people_usa['ID'].isin(people_all ['ID'])]

Example:

In [364]:
people_all = pd.DataFrame({ 'ID' : np.arange(5)})
people_usa = pd.DataFrame({ 'ID' : [3,4,6,7,100]})
people_usa[~people_usa['ID'].isin(people_all['ID'])]

Out[364]:
    ID
2    6
3    7
4  100

so 3 and 4 are removed from the result, the boolean mask looks like this:

In [366]:
people_usa['ID'].isin(people_all['ID'])

Out[366]:
0     True
1     True
2    False
3    False
4    False
Name: ID, dtype: bool

using ~ inverts the mask

edited Sep 18, 2015 at 12:22

answered Sep 18, 2015 at 12:20

EdChum

397k204 gold badges836 silver badges583 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Pekka Over a year ago

Is there any easy way to do this if you have multiple columns to check/join?

EdChum Over a year ago

You could do a merge and then eliminate the rows that exist in the merged df otherwise you'd have to build a boolean condition for all the columns you want to compare but presumably when checking the multiple columns you're stating that it's unique for those columns, correct? For instance it's not a match if say col1 and col2 match but col3 does not

Pekka Over a year ago

Yes merge is what I have been doing but it feels like a hassle. ...I mean something like select * from A where not exists (select * from B where A.col1 = B.col1 and A.col2 = B.col2) I feel like this statement is impossible to do elegantly in pandas :(

unutbu Over a year ago

@Pekka: You could use mask = people_all[primary_key].isin(people_usa[primary_key]).all(axis=1). Then select the nonusa people with people_nonusa = people_all.loc[~mask].

EdChum Over a year ago

@Pekka I agree with unutbu in that you don't necessarily have to do everything in a one-liner and can split the statement to make it more readable

|

MaxU - stand with Ukraine · Accepted Answer · 2016-12-24 22:24:07Z

7

Here is another similar to SQL Pandas method: .query():

people_all.query('ID not in @people_usa.ID')

or using NumPy's in1d() method:

people_all.[~np.in1d(people_all, people_usa)]

NOTE: for those who have experience with SQL it might be worth to read Pandas comparison with SQL

edited Dec 24, 2016 at 22:24

answered Dec 24, 2016 at 22:15

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

1 Comment

princess_hacker Over a year ago

Your second method returns errors for me, unfortunately. For one, the syntax is wrong since it has dot notation and square brackets. Removing the dot, it complains that the item has a wrong length.

Graham Streich · Accepted Answer · 2017-06-22 18:13:51Z

-2

I would combine (by stacking) the data frames and then perform a .drop_duplicates method. Documentation found here:

http://pandas.pydata.org/pandas-docs/version/0.17.1/generated/pandas.DataFrame.drop_duplicates.html

answered Jun 22, 2017 at 18:13

Graham Streich

9243 gold badges16 silver badges33 bronze badges

Collectives™ on Stack Overflow

Python Pandas: how to find rows in one dataframe but not in another?

3 Answers 3

6 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related