How to compare values of certain columns of one dataframe with the values of same set of columns in another dataframe?

Question

I have three dataframes df1, df2, and df3, which are defined as follows

df1 = 
   A  B   C
0  1  a  a1
1  2  b  b2
2  3  c  c3
3  4  d  d4
4  5  e  e5
5  6  f  f6

df2 = 
   A  B  C
0  1  a  X
1  2  b  Y
2  3  c  Z

df3 =
   A  B  C
3  4  d  P
4  5  e  Q
5  6  f  R

I have defined a Primary Key list PK = ["A","B"].

Now, I take a fourth dataframe df4 as df4 = df1.sample(n=2), which gives something like

df4 = 
   A  B   C
4  5  e  e5
1  2  b  b2

Now, I want to select the rows from df2 and df1 which matches the values of the primary keys of df4. For eg, in this case, I need to get row with index = 4 from df3, index = 1 from df2.

If possible I need to get a dataframe as follows:

df =
   A  B   C  A(df2)  B(df2) C(df2)  A(df3)  B(df3)  C(df3)
4  5  e  e5                         5       e       Q
1  2  b  b2  2       b      Y

Any ideas on how to work this out will be very helpful.

Shubham Sharma · Accepted Answer · 2020-07-04 12:14:34Z

2

Use two consecutive DataFrame.merge operations along with using DataFrame.add_suffix on the right dataframe to left merge the dataframes df4, df2, df3, finally use Series.fillna to replace the missing values with empty string:

df = (
    df4.merge(df2.add_suffix('(df2)'), left_on=['A', 'B'], right_on=['A(df2)', 'B(df2)'], how='left')
    .merge(df3.add_suffix('(df3)'), left_on=['A', 'B'], right_on=['A(df3)', 'B(df3)'], how='left')
    .fillna('')
)

Result:

# print(df)

   A  B   C A(df2)  B(df2) C(df2) A(df3) B(df3) C(df3)
0  5  e  e5                           5      e      Q
1  2  b  b2      2      b      Y

edited Jul 4, 2020 at 12:14

answered Jul 4, 2020 at 12:01

Shubham Sharma

71.8k6 gold badges26 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Chris · Accepted Answer · 2020-07-04 11:55:46Z

Here's how I would do it on the entire data set. If you want to sample first, just update the merge statements at the end by replacing df1 with df4 or just take a sample of t

PK = ["A","B"]

df2 = pd.concat([df2,df2], axis=1)
df2.columns=['A','B','C','A(df2)', 'B(df2)', 'C(df2)']
df2.drop(columns=['C'], inplace=True)

df3 = pd.concat([df3,df3], axis=1)
df3.columns=['A','B','C','A(df3)', 'B(df3)', 'C(df3)']
df3.drop(columns=['C'], inplace=True)

t = df1.merge(df2, on=PK, how='left')
t = t.merge(df3, on=PK, how='left')

Output

    A   B   C   A(df2)  B(df2)  C(df2)  A(df3)  B(df3)  C(df3)
0   1   a   a1  1.0     a       X       NaN     NaN     NaN
1   2   b   b2  2.0     b       Y       NaN     NaN     NaN
2   3   c   c3  3.0     c       Z       NaN     NaN     NaN
3   4   d   d4  NaN     NaN     NaN     4.0     d       P
4   5   e   e5  NaN     NaN     NaN     5.0     e       Q
5   6   f   f6  NaN     NaN     NaN     6.0     f       R

Collectives™ on Stack Overflow

How to compare values of certain columns of one dataframe with the values of same set of columns in another dataframe?

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related