1

I know this question might seem repetitive at first, but the truth is that it is not, since I cannot find another answer or similar question whose solution works for me.

I am working with pandas dataframes, using Python language.

Suppose we have 3 datasets, A B and C.

  1. B and C are sub-datasets of A (dataframe A was splitted in two according to a binary column value)
  2. The datasets are of different lenghts (A is the biggest, B and C are smaller since they are sub-datasets of A)
  3. Each dataset has 5 columns: a, b, c, d and e (they have the same column names)
  4. Columns a, b, c and d (taken all together) do not have any repetitions (they could be a composed key of a database)
  5. Column e is different for each dataset
  6. Each dataset has a different index from the others (they were NOT generated using pandas.loc, I am dealing with them "already built and taken from outside python")

My question is: how can I put all these together without losing any row and by pairing them correcty without using the index?

Here's an example:

* Content of A *:
a   b   c   d   e
"x" "y" 0   1   0.99        # 0
"x" "y" 1   1   0.43        # 1
"x" "z" 0   0   0.90        # 2
"y" "z" 0   1   0.11        # 3
"x" "z" 0   1   0.78        # 4

* Content of B *:
a   b   c   d   e
"x" "y" 0   1   0.12        # 0 of dataframe A
"x" "z" 0   0   0.01        # 2 of dataframe A
"y" "z" 0   1   0.45        # 3 of dataframe A

* Content of C *:
a   b   c   d   e
"x" "y" 1   1   0.06        # 1 of dataframe A
"x" "z" 0   0   0.65        # 2 of dataframe A
"x" "z" 0   1   0.20        # 4 of dataframe A

I would like to obtain this output:

* Content of new_df *:
a   b   c   d   e_A   e_B   e_C
"x" "y" 0   1   0.99  0.12  NaN
"x" "y" 1   1   0.43  NaN   0.06
"x" "z" 0   0   0.90  0.01  0.65
"y" "z" 0   1   0.11  0.45  NaN
"x" "z" 0   1   0.78  NaN   0.20

My first trial for the code was the following line, but it deleted the rows which did not have all the three values (instead, I need to insert NaN).

new_df1 = pd.merge(A, B, how='left', left_on=["a", "b", "c", "d"]
new_df2 = pd.merge(new_df1, C, how='left', left_on=["a", "b", "c", "d"]

How can I achieve my objective of getting a full dataset made of all columns (5) and all rows (A contains the maximum amount of rows)?

reproducible input:

A = pd.DataFrame({'a': ['x', 'x', 'x', 'y', 'x'],
                  'b': ['y', 'y', 'z', 'z', 'z'],
                  'c': [0, 1, 0, 0, 0],
                  'd': [1, 1, 0, 1, 1],
                  'e': [0.99, 0.43, 0.9, 0.11, 0.78]})

B = pd.DataFrame({'a': ['x', 'x', 'y'],
                  'b': ['y', 'z', 'z'],
                  'c': [0, 0, 0],
                  'd': [1, 0, 1],
                  'e': [0.12, 0.01, 0.45]})

C = pd.DataFrame({'a': ['x', 'x', 'x'],
                  'b': ['y', 'z', 'z'],
                  'c': [1, 0, 0],
                  'd': [1, 0, 1],
                  'e': [0.06, 0.65, 0.2]})
6
  • Can you provide your dataframes as DataFrame constructors? Commented Apr 15, 2022 at 12:53
  • I am sorry, but I don't understand what you mean @mozway. Can you explain me what should I add to my question please? Commented Apr 15, 2022 at 12:55
  • code like dfA = pd.DataFrame(...) so that one can just copy/paste it to have the objects Commented Apr 15, 2022 at 12:56
  • I created them here while writing the question, let me try to copy them on python... Commented Apr 15, 2022 at 12:57
  • Works fine for me (except the suffixes) Commented Apr 15, 2022 at 13:01

3 Answers 3

2

This seems to work for me.

import pandas as pd

a = pd.read_csv('a.csv')
b = pd.read_csv('b.csv')
c = pd.read_csv('c.csv')

e = a.merge(b, how='left', left_on=['a', 'b', 'c', 'd'], right_on=['a','b','c', 'd'], suffixes=['_A', '_B'])
e = e.merge(c, how='left', left_on=['a', 'b', 'c', 'd'], right_on=['a','b','c', 'd'])

e = e.rename(columns={'e': 'e_C'})

print(e.head())

A bit explanation, pd.DataFrame.merge() currently supports explicitly state the joining columns doc. So using that we can merge it.

And suffix parameter lets you rename the overlapping columns like by adding provided suffix for both df (left_df_suffix, right_df_suffix).

Rest is as you have tried. I renamed the e column of c.csv (or DF) after merging. Hope this helps.

Sign up to request clarification or add additional context in comments.

3 Comments

Hi, thank you for your answer! I am trying to make it work but it returns only row 2 x z 0 0 0.90 0.01 0.65. Upgrading to the latest version of pandas did not help. I upvoted you anyways
@hellomynameisA hi! can you check the attached csv files I used for the solution.. Maybe that can help you figure out why its not working for you? gist.github.com/mehedi-shafi/b0e8e78ba103f6b72e78eb32386f003b
Great idea! This way it works fine... Maybe that's a memory shortage problem with my (very) big datasets... I'll try to work on the datasets somehow, thank you very much
1

Combine the keys which are common in a string as shown in the code below.

    A['pKey'] = A.apply(lambda row: row['a'] + "_" + row['b'] + "_" + str(row['c']) + "_" + str(row['d']), axis=1)
    B['pKey'] = B.apply(lambda row: row['a'] + "_" + row['b'] + "_" + str(row['c']) + "_" + str(row['d']), axis=1)
    C['pKey'] = C.apply(lambda row: row['a'] + "_" + row['b'] + "_" + str(row['c']) + "_" + str(row['d']), axis=1)

Then combine the tables using this new column:

merge_ab = A.merge(B, on='pKey', how='left', suffixes=('_A', '_B'))
merge_abc = merge_ab.merge(C, on='pKey', how='left', suffixes=('', '_C'))

Now drop useless columns.

    a   b   c   d   e_A     e_B     e_C
0   x   y   0   1   0.99    0.12    NaN
1   x   y   1   1   0.43    NaN     0.06
2   x   z   0   0   0.93    0.01    0.65
3   y   z   0   1   0.11    0.45    NaN
4   x   z   0   1   0.78    NaN     0.20

2 Comments

Hi, thank you for your answer! I am trying to make it work but it returns only row 2 x z 0 0 0.90 0.01 0.65. Upgrading to the latest version of pandas did not help. I upvoted you anyways
Hi, thank you for the upvote. Here's the colab notebook link for my answer. Hope this helps. Link: colab.research.google.com/drive/…
1

You can use a dictionary to hold your dataframes and use functools.reduce:

dfs = {'A': A, 'B': B, 'C': C}

from functools import reduce

out= reduce(lambda a, b: a.merge(b, how='left', on=["a", "b", "c", "d"]),
            [d.rename(columns={'e': f'e_{k}'}) for k,d in dfs.items()])

Or, if you have non-merge columns other than "e":

dfs = {'A': A, 'B': B, 'C': C}

from functools import reduce

out = (reduce(lambda a, b: a.join(b, how='left'),
              [d.set_index(["a", "b", "c", "d"]).add_suffix(f'_{k}') 
               for k,d in dfs.items()])
       .reset_index()
      )

output:

   a  b  c  d   e_A   e_B   e_C
0  x  y  0  1  0.99  0.12   NaN
1  x  y  1  1  0.43   NaN  0.06
2  x  z  0  0  0.90  0.01  0.65
3  y  z  0  1  0.11  0.45   NaN
4  x  z  0  1  0.78   NaN  0.20

11 Comments

I'm super sorry to write this, but for some reason it does not work on my machine. I copied-pasted your code and just changed the column names to my real column names and it only adds rows which have all three values. I have pandas 1.3.4, do you think it is too old?
what means "does not work"? error message (which one)? incorrect output (how)?
It returns a dataframe which has only rows which do not have NaN
can you provide a reproducible example?
Example from above: only row 2 x z 0 0 0.90 0.01 0.65
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.