13

I have 2 dataframes:

dat: 1900 obs of 9 variables

              V1        V2        V3        V4 V5      V6       V7     V8             V9
1        V_P50P50_Q3 chr12 106642383 106642395  + 18.1425 4.03e-08 0.0515  GGGGGACTCCCCC
2 V_P50RELAP65_Q5_01  chr8 142276666 142276677  - 16.6429 2.51e-07 0.2780   GGGATTTCCCAC
3          V_RELA_Q6 chr22  51020067  51020078  - 15.9395 2.71e-07 0.3350   GGGAATTTCCCC
4       V_NFKB_Q6_01 chr14  98601454  98601469  + 17.0684 3.08e-07 0.236  GGAGTGGAAATTCC
5          V_CREL_Q6 chr22  51020068  51020079  - 16.1165 3.19e-07 0.4050   AGGGAATTTCCC

dat.markov: 1486 obs of 9 variables

            V1    V2        V3        V4 V5      V6       V7    V8               V9
1 V_NFKB_Q6_01 chr14  98601454  98601469  + 17.2212 1.33e-07 0.146 GGAGTGGAAATTCCCT
2  V_P50P50_Q3 chr12 106642383 106642395  + 16.9358 1.57e-07 0.201    GGGGGACTCCCCC
3    V_CREL_Q6 chr22  51020068  51020079  - 16.0549 2.29e-07 0.292     AGGGAATTTCCC
4 V_NFKB_Q6_01 chr22  51020064  51020079  + 16.9906 2.32e-07 0.146 TTGGGGGAAATTCCCT
5    V_RELA_Q6 chr22  51020067  51020078  - 15.7496 3.42e-07 0.433     GGGAATTTCCCC

I need to merge the two data frames such that I get all the rows with matching columns V1, V2, V3 and V4 between the two data.frames.

I tried:

y<-merge(dat,dat.markov,by=c("V1","V2","V3","V4")) 

which gives me a merged dataframe but with 1513 obs. But technically, the number of observations should be equal to or less than the smaller dataframe i.e. 1486 obs.

My merged data.frame looks alright in terms of number of columns returned:

         V1    V2        V3        V4 V5.x    V6.x     V7.x  V8.x       V9.x V5.y    
1 V_CREL_01 chr10 112778464 112778473    + 12.9434 1.94e-05 0.694 TGGGTTTTCC    +  
     V6.y     V7.y  V8.y       V9.y
1 12.8838 2.35e-05 0.788 TGGGTTTTCC

I know you can intersect the data.frames using one column but is there a way in which you can intersect two data.frames on multiple columns?

2
  • 1
    You're doing this correctly, however, if you have rows where the combination of those three columns is not unique, you can have a very large result! e.g. merge(data.frame(X=rep(1:4, 2), Y1=LETTERS[1:4]), data.frame(X=1:3, Y2=letters[1:3])) where the result is longer than the second input. Commented Sep 9, 2013 at 19:33
  • I think you are right, I just checked and there are quite a few duplicates in my merged dataframe. So depending on the data, my code would work. Thanks! Commented Sep 9, 2013 at 19:51

1 Answer 1

3

If I understand, you want columns 1,2,3,4 to be perfectly matched in both frames in order to keep them? Besides merge, I would use the interaction and match functions

dat$combine = as.character(interaction(dat$V1, dat$V2, dat$V3, dat$V4))
dat.markov$combine = as.character(interaction(dat.markov$V1, dat.markov$V2, dat.markov$V3, dat.markov$V4))

dat.overlap = dat[complete.cases(match(dat.markov$combine, dat$combine)),]
Sign up to request clarification or add additional context in comments.

3 Comments

This doesn't bring the additional columns contained in dat.markov into the resulting dat.overlap like a merge would.
true. you could combine that last line with a similar call to dat.markov and flip the match arguements, then specify the columns you'd like. really though, merge then unique on an ID-type variable is the way to go.
@dylanjf: I am really having memory problems with the interaction function. R just hangs when I try to use it, so I will just have to merge my data. I tried unique previously but I dont think I can use it because even though the V1, V2, V3 and V4 columns are same, the corresponding columns are non-unique.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.