Pandas merge dataframe with the same columns and one one varying

Question

Probably already asked before, buy I cannot find it even after 30 mins of searching.

I have two pandas dataframes with the same columns. The values match except for one column and I want to perform a full outer join, where I get both values if both are there and only one value if one of them is present. There are many matching columns, so I would prefer a solution where I do not have to apply something for each matching column.

Example All columns are the same if the value is in both df, only the frequency varies:

  Gene  GeneID  Frequency
0   AA       1         10
1   BB       2         15
2   CC       3         12
  Gene  GeneID  Frequency
0   AA       1         20
1   DD       4         29

Code:

import pandas as pd


t1 = [{"Gene": "AA", "GeneID": "1" , "Frequency": 10},
{"Gene": "BB", "GeneID": "2" , "Frequency": 15},
{"Gene": "CC", "GeneID": "3" , "Frequency": 12}]


t2 = [{"Gene": "AA", "GeneID": "1" , "Frequency": 20},
{"Gene": "DD", "GeneID": "4" , "Frequency": 29}]

f1 = pd.DataFrame(t1)
f2 = pd.DataFrame(t2)

m = pd.merge(f1,f2,on=['Gene','Gene'],how='outer')

Results in:

  Gene GeneID_x  Frequency_x GeneID_y  Frequency_y
0   AA        1         10.0        1         20.0
1   BB        2         15.0      NaN          NaN
2   CC        3         12.0      NaN          NaN
3   DD      NaN          NaN        4         29.0

Now the ID is either in GeneID_x or or GeneID_y. I would like the following:

  Gene GeneID  Frequency_x   Frequency_y
0   AA        1         10.0         20.0
1   BB        2         15.0          NaN
2   CC        3         12.0          NaN
3   DD        4         NaN          29.0

Of course I can iterate and fill the GeneID where needed, but there are many more columns that match. There has to be a better solution. I also tried concat with group by and aggregate. This works, however I cannot see if the frequency comes from the first or second df if there is only one value.

Thanks.

Corralien · Accepted Answer · 2024-01-17 10:39:09Z

2

You can use:

m = (pd.merge(f1, f2, on='Gene', how='outer')
       .rename(columns={'GeneID_x': 'GeneID'})
       .assign(GeneID=lambda x: x['GeneID'].fillna(x.pop('GeneID_y'))))

Output:

>>> m
  Gene GeneID  Frequency_x  Frequency_y
0   AA      1         10.0         20.0
1   BB      2         15.0          NaN
2   CC      3         12.0          NaN
3   DD      4          NaN         29.0

If (Gene, GeneID) always represent a unique combination, you can use:

m = pd.merge(f1, f2, on=['Gene', 'GeneID'], how='outer')

answered Jan 17, 2024 at 10:39

Corralien

121k8 gold badges43 silver badges68 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Pandas merge dataframe with the same columns and one one varying

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related