0

I'm trying to merge 2 data frames in R, but I have two different columns with different types of ID variable. Sometimes a row will have a value for one of those columns but not the other. I want to consider them both, so that if one frame is missing a value for one of the columns then the other will be used.

> df1 <- data.frame(first = c('a', 'b', NA),  second = c(NA, 'q', 'r'))
> df1
first second
1     a   <NA>
2     b      q
3  <NA>      r

> df2 <- data.frame(first = c('a', NA, 'c'),  second = c('p', 'q', NA))
> df2
  first second
1     a      p
2  <NA>      q
3     c   <NA>

I want to merge these two data frames and get 2 rows:

  • row 1, because it has the same value for "first"
  • row 2, because it has the same value for "second"
  • row 3 would be dropped, because df1 has a value for "second", but not "first", and df2 has the reverse

It's important that NAs are ignored and don't "match" in this case.

I can get kinda close:

> merge(df1,df2, by='first', incomparables = c(NA))
  first second.x second.y
1     a     <NA>        p
> merge(df1,df2, by='second', incomparables = c(NA))
  second first.x first.y
1      q       b    <NA>

But I can't rbind these two data frames together because they have different column names, and it doesn't seem like the "R" way to do it (in the near future, I'll have a 3rd, 4th and even 5th type of ID).

Is there a less clumsy way to do this?

Edit: Ideally, the output would look like this:

> df3 <- data.frame(first = c('a', 'b'), second = c('p','q'))
> df3
  first second
1     a      p
2     b      q
  • row 1, has matched because the column "first" has the same value in both data frames, and it fills in the value for "second" from df2
  • row 2, has matched because the column "second" has the same value in both data frames, and it fills in the value for "first" from df1
  • there is no row 3, because there is no column that has a value in both data frames
5
  • 3
    what is your expected output? Commented Aug 31, 2018 at 13:24
  • 1
    May be like this : from de the result of first merge, create a copy of column "first", and rename the two identical columns "first.x" and "first.y", then do the same form the second merge with column "second", then bind, and after that eliminate the duplicates .. Commented Aug 31, 2018 at 13:35
  • Hi. You give two example outputs, but they have different columns, a contradiction. Please read & act on minimal reproducible example. Part of that is a clear specification. Please give the desired columns & say for a given row (v, ...) under exactly what condition it is in the output, in terms of values v, ..., or for a given row t in terms of values t.v, ... . (Such a clear specificaiton is necessary for you to have asked a clear question but also to code the query, whether by us in answering or by you in the first place.) Commented Sep 1, 2018 at 23:21
  • Appreciate the feedback @philipxy. I intended the two examples to be alternatives, since I wasn't sure if the ideal could be done. The accepted answer produces both. Since it caused confusion to do this I've edited the question to have only one example output. Commented Sep 3, 2018 at 2:14
  • For the future: Note that your text still doesn't say what rows should be in the output as a function of the input. You sort of explain why some example output rows appeared for some example inputs. You seem to maybe be addressing what to do for cases for each row of the cross product of the inputs. But you don't say that that's what you're doing & you don't show you've covered all cases. That is why I suggested that you clearly express a condition in the forms I did. (Often broken into cases via OR.) Eg '(t1.f,t1.s) is in the output if t1 in df1 & t2 in df2 & ...'. Also it maps simply to SQL. Commented Sep 3, 2018 at 2:40

1 Answer 1

1

Using sqldf we can do, as in SQL we can alternate between joining conditions using OR

library(sqldf)
df <- sqldf("select a.*, b.*
               from df1 a
               join df2 b
                    ON a.first = b.first
                    OR a.second = b.second")


library(dplyr)
       #If value in first is NA i.e. is.na(first) is TRUE then use first..3 value's else use first value's and the same for second
df %>% mutate(first = ifelse(is.na(first), first..3, first),
              second = ifelse(is.na(second), second..4, second)) %>% 
       #Discard first..3 and second..4 since we no longer need them    
       select(-first..3, -second..4) 

  first second
1     a      p
2     b      q
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you. The sqldf solution I understand. The dplyr one I'm having some trouble with. Why "first..3" and "second..4" -- what does this mean?
first..3 and secod..4 generated as a result of the joining since df1 and df2 have these two columns with the same name. So what we did is just filling NA's in first and second if any using non-NA's from first..3 and second..4. Finally discard first..3 and second..4 using deselect 'select(-first..3,-second..4)'.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.