Compare Pandas dataframes and add column

Question

I have two dataframe as below

df1     df2 
A       A   C
A1      A1  C1
A2      A2  C2
A3      A3  C3
A1      A4  C4
A2          
A3          
A4

The values of column 'A' are defined in df2 in column 'C'. I want to add a new column to df1 with column B with its value from df2 column 'C'

The final df1 should look like this

df1
A   B
A1  C1
A2  C2
A3  C3
A1  C1
A2  C2
A3  C3
A4  C4

I can loop over df2 and add the value to df1 but its time consuming as the data is huge.

    for index, row in df2.iterrows():
           df1.loc[df1.A.isin([row['A']]), 'B']= row['C']

Can someone help me to understand how can I solve this without looping over df2.

Thanks

jezrael · Accepted Answer · 2016-07-27 09:23:00Z

You can use map by Series:

df1['B'] = df1.A.map(df2.set_index('A')['C'])
print (df1)
    A   B
0  A1  C1
1  A2  C2
2  A3  C3
3  A1  C1
4  A2  C2
5  A3  C3
6  A4  C4

It is same as map by dict:

d = df2.set_index('A')['C'].to_dict()
print (d)
{'A4': 'C4', 'A3': 'C3', 'A2': 'C2', 'A1': 'C1'}

df1['B'] = df1.A.map(d)
print (df1)
    A   B
0  A1  C1
1  A2  C2
2  A3  C3
3  A1  C1
4  A2  C2
5  A3  C3
6  A4  C4

Timings:

len(df1)=7:

In [161]: %timeit merged = df1.merge(df2, on='A', how='left').rename(columns={'C':'B'})
1000 loops, best of 3: 1.73 ms per loop

In [162]: %timeit df1['B'] = df1.A.map(df2.set_index('A')['C'])
The slowest run took 4.44 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 873 µs per loop

len(df1)=70k:

In [164]: %timeit merged = df1.merge(df2, on='A', how='left').rename(columns={'C':'B'})
100 loops, best of 3: 12.8 ms per loop

In [165]: %timeit df1['B'] = df1.A.map(df2.set_index('A')['C'])
100 loops, best of 3: 6.05 ms per loop

EdChum · Accepted Answer · 2016-07-27 09:14:54Z

2

IIUC you can just merge and rename the col

df1.merge(df2, on='A', how='left').rename(columns={'C':'B'})

In [103]:
df1 = pd.DataFrame({'A':['A1','A2','A3','A1','A2','A3','A4']})
df2 = pd.DataFrame({'A':['A1','A2','A3','A4'], 'C':['C1','C2','C4','C4']})
merged = df1.merge(df2, on='A', how='left').rename(columns={'C':'B'})
merged

Out[103]:
    A   B
0  A1  C1
1  A2  C2
2  A3  C4
3  A1  C1
4  A2  C2
5  A3  C4
6  A4  C4

answered Jul 27, 2016 at 9:14

EdChum

397k204 gold badges836 silver badges583 bronze badges

2 Comments

Vinay Gudisagar Over a year ago

Thanks everyone for your suggestion. I used this solution as it would merge other columns in df2 to df1. Thanks @EdChum

EdChum Over a year ago

There is also semantic difference between merge and map in that if the lookup in df1 doesn't exist in df2 then merge will insert NaN whilst map will throw a KeyError

Divakar · Accepted Answer · 2016-07-27 09:53:30Z

2

Based on searchsorted method, here are three approaches with different indexing schemes -

df1['B'] = df2.C[df2.A.searchsorted(df1.A)].values
df1['B'] = df2.C[df2.A.searchsorted(df1.A)].reset_index(drop=True)
df1['B'] = df2.C.values[df2.A.searchsorted(df1.A)]

answered Jul 27, 2016 at 9:53

Divakar

222k19 gold badges273 silver badges374 bronze badges

Collectives™ on Stack Overflow

Compare Pandas dataframes and add column

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related