How to merge data frames in R using alternative columns

Question

I'm trying to merge 2 data frames in R, but I have two different columns with different types of ID variable. Sometimes a row will have a value for one of those columns but not the other. I want to consider them both, so that if one frame is missing a value for one of the columns then the other will be used.

> df1 <- data.frame(first = c('a', 'b', NA),  second = c(NA, 'q', 'r'))
> df1
first second
1     a   <NA>
2     b      q
3  <NA>      r

> df2 <- data.frame(first = c('a', NA, 'c'),  second = c('p', 'q', NA))
> df2
  first second
1     a      p
2  <NA>      q
3     c   <NA>

I want to merge these two data frames and get 2 rows:

row 1, because it has the same value for "first"
row 2, because it has the same value for "second"
row 3 would be dropped, because df1 has a value for "second", but not "first", and df2 has the reverse

It's important that NAs are ignored and don't "match" in this case.

I can get kinda close:

> merge(df1,df2, by='first', incomparables = c(NA))
  first second.x second.y
1     a     <NA>        p
> merge(df1,df2, by='second', incomparables = c(NA))
  second first.x first.y
1      q       b    <NA>

But I can't rbind these two data frames together because they have different column names, and it doesn't seem like the "R" way to do it (in the near future, I'll have a 3rd, 4th and even 5th type of ID).

Is there a less clumsy way to do this?

Edit: Ideally, the output would look like this:

> df3 <- data.frame(first = c('a', 'b'), second = c('p','q'))
> df3
  first second
1     a      p
2     b      q

row 1, has matched because the column "first" has the same value in both data frames, and it fills in the value for "second" from df2
row 2, has matched because the column "second" has the same value in both data frames, and it fills in the value for "first" from df1
there is no row 3, because there is no column that has a value in both data frames

May be like this : from de the result of first merge, create a copy of column "first", and rename the two identical columns "first.x" and "first.y", then do the same form the second merge with column "second", then bind, and after that eliminate the duplicates .. — MrSmithGoesToWashington
– MrSmithGoesToWashington, Commented Aug 31, 2018 at 13:35
Hi. You give two example outputs, but they have different columns, a contradiction. Please read & act on minimal reproducible example. Part of that is a clear specification. Please give the desired columns & say for a given row (v, ...) under exactly what condition it is in the output, in terms of values v, ..., or for a given row t in terms of values t.v, ... . (Such a clear specificaiton is necessary for you to have asked a clear question but also to code the query, whether by us in answering or by you in the first place.) — philipxy
– philipxy, Commented Sep 1, 2018 at 23:21
Appreciate the feedback @philipxy. I intended the two examples to be alternatives, since I wasn't sure if the ideal could be done. The accepted answer produces both. Since it caused confusion to do this I've edited the question to have only one example output. — Hissohathair
– Hissohathair, Commented Sep 3, 2018 at 2:14
For the future: Note that your text still doesn't say what rows should be in the output as a function of the input. You sort of explain why some example output rows appeared for some example inputs. You seem to maybe be addressing what to do for cases for each row of the cross product of the inputs. But you don't say that that's what you're doing & you don't show you've covered all cases. That is why I suggested that you clearly express a condition in the forms I did. (Often broken into cases via OR.) Eg '(t1.f,t1.s) is in the output if t1 in df1 & t2 in df2 & ...'. Also it maps simply to SQL. — philipxy
– philipxy, Commented Sep 3, 2018 at 2:40

A. Suliman · Accepted Answer · 2018-09-01 07:42:12Z

1

Using sqldf we can do, as in SQL we can alternate between joining conditions using OR

library(sqldf)
df <- sqldf("select a.*, b.*
               from df1 a
               join df2 b
                    ON a.first = b.first
                    OR a.second = b.second")


library(dplyr)
       #If value in first is NA i.e. is.na(first) is TRUE then use first..3 value's else use first value's and the same for second
df %>% mutate(first = ifelse(is.na(first), first..3, first),
              second = ifelse(is.na(second), second..4, second)) %>% 
       #Discard first..3 and second..4 since we no longer need them    
       select(-first..3, -second..4) 

  first second
1     a      p
2     b      q

edited Sep 1, 2018 at 7:42

answered Aug 31, 2018 at 14:04

A. Suliman

13.2k6 gold badges27 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Hissohathair Over a year ago

Thank you. The sqldf solution I understand. The dplyr one I'm having some trouble with. Why "first..3" and "second..4" -- what does this mean?

A. Suliman Over a year ago

first..3 and secod..4 generated as a result of the joining since df1 and df2 have these two columns with the same name. So what we did is just filling NA's in first and second if any using non-NA's from first..3 and second..4. Finally discard first..3 and second..4 using deselect 'select(-first..3,-second..4)'.

Collectives™ on Stack Overflow

How to merge data frames in R using alternative columns

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related