Merging columns with overlapping data in R data frames

Question

a<-data.frame(cbind("Sample"=c("100","101","102","103"),"Status"=c("Y","","","partial")))
b<-data.frame(cbind("Sample"=c("100","101","102","103","106"),"Status"=c("NA","Y","","","Y")))

desired<-data.frame(cbind("Sample"=c("100","101","102","103","106"),"Status"=c("Y","Y","","partial","Y")))

I have sample processing data in multiple sources and I'd like to combine them into a master list. How can I merge the "Status" column between 2 data frames such that a overrules b in order to collate "Y" and "partial" for each sample? Thank you in advance.

Both variables of a and of b are factors. Working with factors like this is a pain in the neck. You should consider converting these to character and numeric, which are easier to work with. — lmo
– lmo, Commented Jun 7, 2017 at 15:50
Just use data.frame without the cbind, or you're making a matrix before converting it to a data.frame, which will sooner or later screw up types. Also, using NA instead of "" will make your life easier. — alistaire
– alistaire, Commented Jun 7, 2017 at 15:52
Alistaire, you're right, my example is a bit sloppy with the cbind. The example is an over-simplification as there are ~10 non ""/NA strings that can exist (not just partial/Y). This makes Mudskipper's solution a bit trickier. I'm not familiar with Simone's ":=" syntax, and it doesn't appear to run. — sm002
– sm002, Commented Jun 7, 2017 at 19:56

simone · Accepted Answer · 2017-06-10 10:13:17Z

1

require(data.table)    

a<-data.table(cbind("Sample"=c("100","101","102","103"),"Status"=c("Y","","","partial")))
b<-data.table("Sample"=c("100","101","102","103","106"),"Status"=c("NA","Y","","","Y"))

c <- merge(a, b, by = "Sample", all=TRUE)
c[,Status := ifelse(!is.na(Status.x), Status.x, Status.y)]
c[,`:=` (Status.x=NULL, Status.y = NULL)]

edited Jun 10, 2017 at 10:13

answered Jun 7, 2017 at 16:00

simone

5771 gold badge7 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

sm002 Over a year ago

Hi Simone, I like that this approach is more generalized, but := doesn't seem to work. Where is the syntax error?

simone Over a year ago

@sm002 I updated the answer. You need to load data.table

moodymudskipper · Accepted Answer · 2017-06-07 16:04:38Z

1

I assume you want to keep the values from a and b with an order of priority, Y covers partial that covers NA that covers nothing.

d <- merge(a,b,by="Sample",all=TRUE)
d$Status <- ""
d$Status[apply(c,1,function(x){any(is.na(x))})] <- "" # cleaning the NAs I introduced with the merge
d$Status[apply(c,1,`%in%`, x = "NA")] <- NA # or "NA" if you want to keep it this way, or "" if you want to get rid of them
d$Status[apply(c,1,`%in%`, x = "partial")] <- "partial"
d$Status[apply(c,1,`%in%`, x = "Y")] <- "Y"
d <- d[,c(1,4)]

# Sample  Status
# 1    100       Y
# 2    101       Y
# 3    102        
# 4    103 partial
# 5    106       Y

edited Jun 7, 2017 at 16:04

answered Jun 7, 2017 at 15:59

moodymudskipper

47.7k12 gold badges131 silver badges185 bronze badges

1 Comment

moodymudskipper Over a year ago

my merge is adding some NAs though (real NAs, not "NA"), so if you have real NAs in your data set and want to keep those for some reason, you'll have to replace them by something else in a and b (like "NA", or Inf or whatever)

Collectives™ on Stack Overflow

Merging columns with overlapping data in R data frames

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related