1

Probably already asked before, buy I cannot find it even after 30 mins of searching.

I have two pandas dataframes with the same columns. The values match except for one column and I want to perform a full outer join, where I get both values if both are there and only one value if one of them is present. There are many matching columns, so I would prefer a solution where I do not have to apply something for each matching column.

Example All columns are the same if the value is in both df, only the frequency varies:

  Gene  GeneID  Frequency
0   AA       1         10
1   BB       2         15
2   CC       3         12
  Gene  GeneID  Frequency
0   AA       1         20
1   DD       4         29

Code:

import pandas as pd


t1 = [{"Gene": "AA", "GeneID": "1" , "Frequency": 10},
{"Gene": "BB", "GeneID": "2" , "Frequency": 15},
{"Gene": "CC", "GeneID": "3" , "Frequency": 12}]


t2 = [{"Gene": "AA", "GeneID": "1" , "Frequency": 20},
{"Gene": "DD", "GeneID": "4" , "Frequency": 29}]

f1 = pd.DataFrame(t1)
f2 = pd.DataFrame(t2)

m = pd.merge(f1,f2,on=['Gene','Gene'],how='outer')

Results in:

  Gene GeneID_x  Frequency_x GeneID_y  Frequency_y
0   AA        1         10.0        1         20.0
1   BB        2         15.0      NaN          NaN
2   CC        3         12.0      NaN          NaN
3   DD      NaN          NaN        4         29.0

Now the ID is either in GeneID_x or or GeneID_y. I would like the following:

  Gene GeneID  Frequency_x   Frequency_y
0   AA        1         10.0         20.0
1   BB        2         15.0          NaN
2   CC        3         12.0          NaN
3   DD        4         NaN          29.0

Of course I can iterate and fill the GeneID where needed, but there are many more columns that match. There has to be a better solution. I also tried concat with group by and aggregate. This works, however I cannot see if the frequency comes from the first or second df if there is only one value.

Thanks.

1 Answer 1

2

You can use:

m = (pd.merge(f1, f2, on='Gene', how='outer')
       .rename(columns={'GeneID_x': 'GeneID'})
       .assign(GeneID=lambda x: x['GeneID'].fillna(x.pop('GeneID_y'))))

Output:

>>> m
  Gene GeneID  Frequency_x  Frequency_y
0   AA      1         10.0         20.0
1   BB      2         15.0          NaN
2   CC      3         12.0          NaN
3   DD      4          NaN         29.0

If (Gene, GeneID) always represent a unique combination, you can use:

m = pd.merge(f1, f2, on=['Gene', 'GeneID'], how='outer')
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.