Python How to get values that are in one dataframe but not a second

Question

I am working with two dataframes that are very similar, and I'm trying to figure out how to get the data that is in one and not the other - and visa versa.

Here is my code so far:

import pandas as pd
import numpy as np


def report_diff(x):
return x[0] if x[0] == x[1] else '{} ---> {}'.format(*x)

old = pd.read_excel('File 1')
new = pd.read_excel('File 2')
old['version'] = 'old'
new['version'] = 'new'

full_set = pd.concat([old,new],ignore_index=True)

changes = full_set.drop_duplicates(subset=['ID','Type', 'Total'], keep='last')

duplicated = changes.duplicated(subset=['ID', 'Type'], keep=False)

dupe_accts = changes[duplicated]

change_new = dupe_accts[(dupe_accts['version'] == 'new')]

change_old = dupe_accts[(dupe_accts['version'] == 'old' )]

change_new = change_new.drop(['version'], axis=1)

change_old = change_old.drop(['version'],axis=1)

change_new.set_index('Employee ID', inplace=True)

change_old.set_index('Employee ID', inplace=True)

diff_panel = pd.Panel(dict(df1=change_old,df2=change_new))
diff_output = diff_panel.apply(report_diff, axis=0)

And so the next step would be to get the data that is only in old and only in new.

My first attempt was:

 changes['duplicate']=changes['Employee ID'].isin(dupe_accts)
removed_accounts = changes[(changes['duplicate'] == False) & (changes['version'] =='old')]

Fix the quote in new = pd.read_excel('File 2).

Mad Physicist
– Mad Physicist

2017-04-06 18:34:53 +00:00
Commented Apr 6, 2017 at 18:34 — Mad Physicist
– Mad Physicist, Commented Apr 6, 2017 at 18:34
stackoverflow.com/questions/20225110/…

Dadep
– Dadep

2017-04-06 18:38:37 +00:00
Commented Apr 6, 2017 at 18:38 — Dadep
– Dadep, Commented Apr 6, 2017 at 18:38

piRSquared · Accepted Answer · 2017-04-06 18:49:12Z

4

I got dizzy looking at your code!

IIUC:

Use the parameter indicator=True within a merge

Consider the dataframes old and new

old = pd.DataFrame(dict(
        ID=[1, 2, 3, 4, 5],
        Type=list('AAABB'),
        Total=[9 for _ in range(5)],
        ArbitraryColumn=['blah' for _ in range(5)]
    ))

new = old.head(2)

Then merge and query the left_only

old.merge(
    new, 'outer', on=['ID', 'Type'],
    suffixes=['', '_'], indicator=True
).query('_merge == "left_only"')

  ArbitraryColumn  ID  Total Type ArbitraryColumn_  Total_     _merge
2            blah   3      9    A              NaN     NaN  left_only
3            blah   4      9    B              NaN     NaN  left_only
4            blah   5      9    B              NaN     NaN  left_only

We can reindex to restrict to the original columns

old.merge(
    new, 'outer', on=['ID', 'Type'],
    suffixes=['', '_'], indicator=True
).query('_merge == "left_only"').reindex_axis(old.columns, axis=1)

  ArbitraryColumn  ID  Total Type
2            blah   3      9    A
3            blah   4      9    B
4            blah   5      9    B

answered Apr 6, 2017 at 18:49

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Vaishali Over a year ago

I could upvote twice but I already did reading the first sentence:)

Troy R Over a year ago

@piRSquared Thank you! Very simply solution to what I was looking for.

piRSquared Over a year ago

@A-Za-z btw folks on Meta lack a sense of humor meta.stackexchange.com/q/293438/326787

Vaishali Over a year ago

that's not a bug; it's a feature :D

Troy R Over a year ago

@piRSquared What would be the way to get data that is in the right data frame but no the left? I changed to "righ_only" and I get the two columns I join on, but it's still data only in the left. stackoverflow.com/questions/43454258/…

|

Collectives™ on Stack Overflow

Python How to get values that are in one dataframe but not a second

1 Answer 1

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related