Join/Merge two or more pandas dataframes which have 4 columns in common

Question

I know this question might seem repetitive at first, but the truth is that it is not, since I cannot find another answer or similar question whose solution works for me.

I am working with pandas dataframes, using Python language.

Suppose we have 3 datasets, A B and C.

B and C are sub-datasets of A (dataframe A was splitted in two according to a binary column value)
The datasets are of different lenghts (A is the biggest, B and C are smaller since they are sub-datasets of A)
Each dataset has 5 columns: a, b, c, d and e (they have the same column names)
Columns a, b, c and d (taken all together) do not have any repetitions (they could be a composed key of a database)
Column e is different for each dataset
Each dataset has a different index from the others (they were NOT generated using pandas.loc, I am dealing with them "already built and taken from outside python")

My question is: how can I put all these together without losing any row and by pairing them correcty without using the index?

Here's an example:

* Content of A *:
a   b   c   d   e
"x" "y" 0   1   0.99        # 0
"x" "y" 1   1   0.43        # 1
"x" "z" 0   0   0.90        # 2
"y" "z" 0   1   0.11        # 3
"x" "z" 0   1   0.78        # 4

* Content of B *:
a   b   c   d   e
"x" "y" 0   1   0.12        # 0 of dataframe A
"x" "z" 0   0   0.01        # 2 of dataframe A
"y" "z" 0   1   0.45        # 3 of dataframe A

* Content of C *:
a   b   c   d   e
"x" "y" 1   1   0.06        # 1 of dataframe A
"x" "z" 0   0   0.65        # 2 of dataframe A
"x" "z" 0   1   0.20        # 4 of dataframe A

I would like to obtain this output:

* Content of new_df *:
a   b   c   d   e_A   e_B   e_C
"x" "y" 0   1   0.99  0.12  NaN
"x" "y" 1   1   0.43  NaN   0.06
"x" "z" 0   0   0.90  0.01  0.65
"y" "z" 0   1   0.11  0.45  NaN
"x" "z" 0   1   0.78  NaN   0.20

My first trial for the code was the following line, but it deleted the rows which did not have all the three values (instead, I need to insert NaN).

new_df1 = pd.merge(A, B, how='left', left_on=["a", "b", "c", "d"]
new_df2 = pd.merge(new_df1, C, how='left', left_on=["a", "b", "c", "d"]

How can I achieve my objective of getting a full dataset made of all columns (5) and all rows (A contains the maximum amount of rows)?

reproducible input:

A = pd.DataFrame({'a': ['x', 'x', 'x', 'y', 'x'],
                  'b': ['y', 'y', 'z', 'z', 'z'],
                  'c': [0, 1, 0, 0, 0],
                  'd': [1, 1, 0, 1, 1],
                  'e': [0.99, 0.43, 0.9, 0.11, 0.78]})

B = pd.DataFrame({'a': ['x', 'x', 'y'],
                  'b': ['y', 'z', 'z'],
                  'c': [0, 0, 0],
                  'd': [1, 0, 1],
                  'e': [0.12, 0.01, 0.45]})

C = pd.DataFrame({'a': ['x', 'x', 'x'],
                  'b': ['y', 'z', 'z'],
                  'c': [1, 0, 0],
                  'd': [1, 0, 1],
                  'e': [0.06, 0.65, 0.2]})

I am sorry, but I don't understand what you mean @mozway. Can you explain me what should I add to my question please? — hellomynameisA
– hellomynameisA, Commented Apr 15, 2022 at 12:55
code like dfA = pd.DataFrame(...) so that one can just copy/paste it to have the objects — mozway
– mozway, Commented Apr 15, 2022 at 12:56
I created them here while writing the question, let me try to copy them on python... — hellomynameisA
– hellomynameisA, Commented Apr 15, 2022 at 12:57

exilour · Accepted Answer · 2022-04-15 13:05:43Z

2

This seems to work for me.

import pandas as pd

a = pd.read_csv('a.csv')
b = pd.read_csv('b.csv')
c = pd.read_csv('c.csv')

e = a.merge(b, how='left', left_on=['a', 'b', 'c', 'd'], right_on=['a','b','c', 'd'], suffixes=['_A', '_B'])
e = e.merge(c, how='left', left_on=['a', 'b', 'c', 'd'], right_on=['a','b','c', 'd'])

e = e.rename(columns={'e': 'e_C'})

print(e.head())

A bit explanation, pd.DataFrame.merge() currently supports explicitly state the joining columns doc. So using that we can merge it.

And suffix parameter lets you rename the overlapping columns like by adding provided suffix for both df (left_df_suffix, right_df_suffix).

Rest is as you have tried. I renamed the e column of c.csv (or DF) after merging. Hope this helps.

edited Apr 15, 2022 at 13:05

answered Apr 15, 2022 at 13:00

exilour

5565 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

hellomynameisA Over a year ago

Hi, thank you for your answer! I am trying to make it work but it returns only row 2 x z 0 0 0.90 0.01 0.65. Upgrading to the latest version of pandas did not help. I upvoted you anyways

exilour Over a year ago

@hellomynameisA hi! can you check the attached csv files I used for the solution.. Maybe that can help you figure out why its not working for you? gist.github.com/mehedi-shafi/b0e8e78ba103f6b72e78eb32386f003b

hellomynameisA Over a year ago

Great idea! This way it works fine... Maybe that's a memory shortage problem with my (very) big datasets... I'll try to work on the datasets somehow, thank you very much

Shreyansh Gupta · Accepted Answer · 2022-04-15 13:03:38Z

1

Combine the keys which are common in a string as shown in the code below.

    A['pKey'] = A.apply(lambda row: row['a'] + "_" + row['b'] + "_" + str(row['c']) + "_" + str(row['d']), axis=1)
    B['pKey'] = B.apply(lambda row: row['a'] + "_" + row['b'] + "_" + str(row['c']) + "_" + str(row['d']), axis=1)
    C['pKey'] = C.apply(lambda row: row['a'] + "_" + row['b'] + "_" + str(row['c']) + "_" + str(row['d']), axis=1)

Then combine the tables using this new column:

merge_ab = A.merge(B, on='pKey', how='left', suffixes=('_A', '_B'))
merge_abc = merge_ab.merge(C, on='pKey', how='left', suffixes=('', '_C'))

Now drop useless columns.

    a   b   c   d   e_A     e_B     e_C
0   x   y   0   1   0.99    0.12    NaN
1   x   y   1   1   0.43    NaN     0.06
2   x   z   0   0   0.93    0.01    0.65
3   y   z   0   1   0.11    0.45    NaN
4   x   z   0   1   0.78    NaN     0.20

answered Apr 15, 2022 at 13:03

Shreyansh Gupta

3182 silver badges7 bronze badges

2 Comments

hellomynameisA Over a year ago

Hi, thank you for your answer! I am trying to make it work but it returns only row 2 x z 0 0 0.90 0.01 0.65. Upgrading to the latest version of pandas did not help. I upvoted you anyways

Shreyansh Gupta Over a year ago

Hi, thank you for the upvote. Here's the colab notebook link for my answer. Hope this helps. Link: colab.research.google.com/drive/…

mozway · Accepted Answer · 2022-04-15 13:06:16Z

1

You can use a dictionary to hold your dataframes and use functools.reduce:

dfs = {'A': A, 'B': B, 'C': C}

from functools import reduce

out= reduce(lambda a, b: a.merge(b, how='left', on=["a", "b", "c", "d"]),
            [d.rename(columns={'e': f'e_{k}'}) for k,d in dfs.items()])

Or, if you have non-merge columns other than "e":

dfs = {'A': A, 'B': B, 'C': C}

from functools import reduce

out = (reduce(lambda a, b: a.join(b, how='left'),
              [d.set_index(["a", "b", "c", "d"]).add_suffix(f'_{k}') 
               for k,d in dfs.items()])
       .reset_index()
      )

output:

   a  b  c  d   e_A   e_B   e_C
0  x  y  0  1  0.99  0.12   NaN
1  x  y  1  1  0.43   NaN  0.06
2  x  z  0  0  0.90  0.01  0.65
3  y  z  0  1  0.11  0.45   NaN
4  x  z  0  1  0.78   NaN  0.20

answered Apr 15, 2022 at 13:06

mozway

267k13 gold badges56 silver badges106 bronze badges

11 Comments

hellomynameisA Over a year ago

I'm super sorry to write this, but for some reason it does not work on my machine. I copied-pasted your code and just changed the column names to my real column names and it only adds rows which have all three values. I have pandas 1.3.4, do you think it is too old?

mozway Over a year ago

what means "does not work"? error message (which one)? incorrect output (how)?

hellomynameisA Over a year ago

It returns a dataframe which has only rows which do not have NaN

mozway Over a year ago

can you provide a reproducible example?

hellomynameisA Over a year ago

Example from above: only row 2 x z 0 0 0.90 0.01 0.65

|

Collectives™ on Stack Overflow

Join/Merge two or more pandas dataframes which have 4 columns in common

3 Answers 3

3 Comments

2 Comments

11 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

2 Comments

11 Comments

Your Answer

Sign up or log in

Post as a guest

Related