1

I have three dataframes df1, df2, and df3, which are defined as follows

df1 = 
   A  B   C
0  1  a  a1
1  2  b  b2
2  3  c  c3
3  4  d  d4
4  5  e  e5
5  6  f  f6

df2 = 
   A  B  C
0  1  a  X
1  2  b  Y
2  3  c  Z

df3 =
   A  B  C
3  4  d  P
4  5  e  Q
5  6  f  R

I have defined a Primary Key list PK = ["A","B"].

Now, I take a fourth dataframe df4 as df4 = df1.sample(n=2), which gives something like

df4 = 
   A  B   C
4  5  e  e5
1  2  b  b2

Now, I want to select the rows from df2 and df1 which matches the values of the primary keys of df4. For eg, in this case, I need to get row with index = 4 from df3, index = 1 from df2.

If possible I need to get a dataframe as follows:

df =
   A  B   C  A(df2)  B(df2) C(df2)  A(df3)  B(df3)  C(df3)
4  5  e  e5                         5       e       Q
1  2  b  b2  2       b      Y

Any ideas on how to work this out will be very helpful.

2 Answers 2

2

Use two consecutive DataFrame.merge operations along with using DataFrame.add_suffix on the right dataframe to left merge the dataframes df4, df2, df3, finally use Series.fillna to replace the missing values with empty string:

df = (
    df4.merge(df2.add_suffix('(df2)'), left_on=['A', 'B'], right_on=['A(df2)', 'B(df2)'], how='left')
    .merge(df3.add_suffix('(df3)'), left_on=['A', 'B'], right_on=['A(df3)', 'B(df3)'], how='left')
    .fillna('')
)

Result:

# print(df)

   A  B   C A(df2)  B(df2) C(df2) A(df3) B(df3) C(df3)
0  5  e  e5                           5      e      Q
1  2  b  b2      2      b      Y                    
Sign up to request clarification or add additional context in comments.

Comments

1

Here's how I would do it on the entire data set. If you want to sample first, just update the merge statements at the end by replacing df1 with df4 or just take a sample of t

PK = ["A","B"]

df2 = pd.concat([df2,df2], axis=1)
df2.columns=['A','B','C','A(df2)', 'B(df2)', 'C(df2)']
df2.drop(columns=['C'], inplace=True)

df3 = pd.concat([df3,df3], axis=1)
df3.columns=['A','B','C','A(df3)', 'B(df3)', 'C(df3)']
df3.drop(columns=['C'], inplace=True)

t = df1.merge(df2, on=PK, how='left')
t = t.merge(df3, on=PK, how='left')

Output

    A   B   C   A(df2)  B(df2)  C(df2)  A(df3)  B(df3)  C(df3)
0   1   a   a1  1.0     a       X       NaN     NaN     NaN
1   2   b   b2  2.0     b       Y       NaN     NaN     NaN
2   3   c   c3  3.0     c       Z       NaN     NaN     NaN
3   4   d   d4  NaN     NaN     NaN     4.0     d       P
4   5   e   e5  NaN     NaN     NaN     5.0     e       Q
5   6   f   f6  NaN     NaN     NaN     6.0     f       R

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.