I know this question might seem repetitive at first, but the truth is that it is not, since I cannot find another answer or similar question whose solution works for me.
I am working with pandas dataframes, using Python language.
Suppose we have 3 datasets, A B and C.
- B and C are sub-datasets of A (dataframe A was splitted in two according to a binary column value)
- The datasets are of different lenghts (A is the biggest, B and C are smaller since they are sub-datasets of A)
- Each dataset has 5 columns: a, b, c, d and e (they have the same column names)
- Columns a, b, c and d (taken all together) do not have any repetitions (they could be a composed key of a database)
- Column e is different for each dataset
- Each dataset has a different index from the others (they were NOT generated using pandas.loc, I am dealing with them "already built and taken from outside python")
My question is: how can I put all these together without losing any row and by pairing them correcty without using the index?
Here's an example:
* Content of A *:
a b c d e
"x" "y" 0 1 0.99 # 0
"x" "y" 1 1 0.43 # 1
"x" "z" 0 0 0.90 # 2
"y" "z" 0 1 0.11 # 3
"x" "z" 0 1 0.78 # 4
* Content of B *:
a b c d e
"x" "y" 0 1 0.12 # 0 of dataframe A
"x" "z" 0 0 0.01 # 2 of dataframe A
"y" "z" 0 1 0.45 # 3 of dataframe A
* Content of C *:
a b c d e
"x" "y" 1 1 0.06 # 1 of dataframe A
"x" "z" 0 0 0.65 # 2 of dataframe A
"x" "z" 0 1 0.20 # 4 of dataframe A
I would like to obtain this output:
* Content of new_df *:
a b c d e_A e_B e_C
"x" "y" 0 1 0.99 0.12 NaN
"x" "y" 1 1 0.43 NaN 0.06
"x" "z" 0 0 0.90 0.01 0.65
"y" "z" 0 1 0.11 0.45 NaN
"x" "z" 0 1 0.78 NaN 0.20
My first trial for the code was the following line, but it deleted the rows which did not have all the three values (instead, I need to insert NaN).
new_df1 = pd.merge(A, B, how='left', left_on=["a", "b", "c", "d"]
new_df2 = pd.merge(new_df1, C, how='left', left_on=["a", "b", "c", "d"]
How can I achieve my objective of getting a full dataset made of all columns (5) and all rows (A contains the maximum amount of rows)?
reproducible input:
A = pd.DataFrame({'a': ['x', 'x', 'x', 'y', 'x'],
'b': ['y', 'y', 'z', 'z', 'z'],
'c': [0, 1, 0, 0, 0],
'd': [1, 1, 0, 1, 1],
'e': [0.99, 0.43, 0.9, 0.11, 0.78]})
B = pd.DataFrame({'a': ['x', 'x', 'y'],
'b': ['y', 'z', 'z'],
'c': [0, 0, 0],
'd': [1, 0, 1],
'e': [0.12, 0.01, 0.45]})
C = pd.DataFrame({'a': ['x', 'x', 'x'],
'b': ['y', 'z', 'z'],
'c': [1, 0, 0],
'd': [1, 0, 1],
'e': [0.06, 0.65, 0.2]})
dfA = pd.DataFrame(...)so that one can just copy/paste it to have the objects