1

I am working with two dataframes that are very similar, and I'm trying to figure out how to get the data that is in one and not the other - and visa versa.

Here is my code so far:

import pandas as pd
import numpy as np


def report_diff(x):
return x[0] if x[0] == x[1] else '{} ---> {}'.format(*x)

old = pd.read_excel('File 1')
new = pd.read_excel('File 2')
old['version'] = 'old'
new['version'] = 'new'

full_set = pd.concat([old,new],ignore_index=True)

changes = full_set.drop_duplicates(subset=['ID','Type', 'Total'], keep='last')

duplicated = changes.duplicated(subset=['ID', 'Type'], keep=False)

dupe_accts = changes[duplicated]

change_new = dupe_accts[(dupe_accts['version'] == 'new')]

change_old = dupe_accts[(dupe_accts['version'] == 'old' )]

change_new = change_new.drop(['version'], axis=1)

change_old = change_old.drop(['version'],axis=1)

change_new.set_index('Employee ID', inplace=True)

change_old.set_index('Employee ID', inplace=True)

diff_panel = pd.Panel(dict(df1=change_old,df2=change_new))
diff_output = diff_panel.apply(report_diff, axis=0)

And so the next step would be to get the data that is only in old and only in new.

My first attempt was:

 changes['duplicate']=changes['Employee ID'].isin(dupe_accts)
removed_accounts = changes[(changes['duplicate'] == False) & (changes['version'] =='old')]
2

1 Answer 1

4

I got dizzy looking at your code!

IIUC:

Use the parameter indicator=True within a merge

Consider the dataframes old and new

old = pd.DataFrame(dict(
        ID=[1, 2, 3, 4, 5],
        Type=list('AAABB'),
        Total=[9 for _ in range(5)],
        ArbitraryColumn=['blah' for _ in range(5)]
    ))

new = old.head(2)

Then merge and query the left_only

old.merge(
    new, 'outer', on=['ID', 'Type'],
    suffixes=['', '_'], indicator=True
).query('_merge == "left_only"')

  ArbitraryColumn  ID  Total Type ArbitraryColumn_  Total_     _merge
2            blah   3      9    A              NaN     NaN  left_only
3            blah   4      9    B              NaN     NaN  left_only
4            blah   5      9    B              NaN     NaN  left_only

We can reindex to restrict to the original columns

old.merge(
    new, 'outer', on=['ID', 'Type'],
    suffixes=['', '_'], indicator=True
).query('_merge == "left_only"').reindex_axis(old.columns, axis=1)

  ArbitraryColumn  ID  Total Type
2            blah   3      9    A
3            blah   4      9    B
4            blah   5      9    B
Sign up to request clarification or add additional context in comments.

7 Comments

I could upvote twice but I already did reading the first sentence:)
@piRSquared Thank you! Very simply solution to what I was looking for.
@A-Za-z btw folks on Meta lack a sense of humor meta.stackexchange.com/q/293438/326787
that's not a bug; it's a feature :D
@piRSquared What would be the way to get data that is in the right data frame but no the left? I changed to "righ_only" and I get the two columns I join on, but it's still data only in the left. stackoverflow.com/questions/43454258/…
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.