New column based on matching values from another dataframe pandas

Question

If we have two dataframes such as df1 and df2 in the example shown below; how do we merge them to generate df3?

import pandas as pd
import numpy as np

data1 = [("a1",["A","B"]),("a2",["A","B","C"]),("a3",["B","C"])]
df1 = pd.DataFrame(data1,columns = ["column1","column2"])
print df1

data2 = [("A",["1","2"]),("B",["1","3","4"]),("C",["5"])]
df2 = pd.DataFrame(data2,columns=["column3","column4"])
print df2

data3 = [("a1",["A","B"],["1","2","3","4"]),("a2",["A","B","C"], 
["1","2","3","4","5"]),("a3",["B","C"],["1","3","4","5"])]
df3 = pd.DataFrame(data3,columns = ["column1","column2","column5"])
print df3

I am aiming not to use for loops since I am dealing with big datasets

BENY · Accepted Answer · 2019-03-13 14:19:56Z

7

Check with stack df1's list columns after re-create with DataFrame then map the value from df2

Also since you asking not using for loop I am using sum , and sum for this case is much slower than *for loop* or itertools

s=pd.DataFrame(df1.column2.tolist()).stack()
df1['New']=s.map(df2.set_index('column3').column4).sum(level=0).apply(set)
df1
Out[36]: 
  column1    column2              New
0      a1     [A, B]     {2, 4, 3, 1}
1      a2  [A, B, C]  {3, 5, 4, 2, 1}
2      a3     [B, C]     {4, 3, 1, 5}

As I mentioned and most of us suggested , also you can check with For loops with pandas - When should I care?

import itertools
d=dict(zip(df2.column3,df2.column4))


l=[set(itertools.chain(*[d[y] for y in x ])) for x in df1.column2.tolist()]
df1['New']=l

edited Mar 13, 2019 at 14:19

answered Mar 13, 2019 at 13:41

BENY

324k22 gold badges176 silver badges250 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Xenobiologist Over a year ago

You have to remove the duplicates, don't you?

It_is_Chris Over a year ago

@Wen-Ben .apply(set) rather than .apply(tuple)?

BENY Over a year ago

@jezrael yep I would say for loop is perfect for this type of question

BENY Over a year ago

@coldspeed glad I can help:-)

Sociopath · Accepted Answer · 2019-03-13 14:00:42Z

2

You can do it as below:

df2_dict = {i:j for i,j in zip(df2['column3'].values, df2['column4'].values)}
# print(df2_dict)

def func(val):
    return sorted(list(set(np.concatenate([df2_dict.get(i) for i in val]))))

df1['column5'] = df1['column2'].apply(func)
print(df1)

Output:

  column1    column2          column5
0      a1     [A, B]     [1, 2, 3, 4]
1      a2  [A, B, C]  [1, 2, 3, 4, 5]
2      a3     [B, C]     [1, 3, 4, 5]

answered Mar 13, 2019 at 14:00

Sociopath

13.4k22 gold badges53 silver badges82 bronze badges

Comments

Rajat Jain · Accepted Answer · 2019-03-13 14:18:09Z

0

This works:

df1['column2'].apply(lambda x: list(set((np.concatenate([df2.set_index('column3')['column4'][i] for i in list(x)])) )))

answered Mar 13, 2019 at 14:18

Rajat Jain

2,0422 gold badges17 silver badges23 bronze badges

Collectives™ on Stack Overflow

New column based on matching values from another dataframe pandas

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related