Combine set of conditions in data.table to extract value using binary search

Question

Since my terrible execution and interpretation of my previous question I'll start over and will try to formulate the question as short and general possible.

I have two dataframes (see the examples below). Each dataset contains the same number of columns.

tc <- textConnection('
ID  Track1  Track2  Track3  Track4  Time    Loc
4   15      ""      ""      50      40      1   
5   17      115     109     55      50      1   
6   17      115     109     55      60      1   
7   13      195     150     60      70      1
8   13      195     150     60      80      1
9   ""      ""      181     70      90      2 #From this row, example data added
10  ""      ""      182     70      92      2
11  429     31      ""      80      95      3
12  480     31      12      80      96      3 
13  118     ""      ""      90      100     4
14  120     16      213     90      101     4   
')

MATCHINGS <- read.table(tc, header=TRUE)

tc <- textConnection('
ID  Track1  Track2  Track3  Track4  Time    Loc
""  15      ""      ""      50      40      1   
""  17      ""     109      55      50      1
""  17      432    109      55      65      1   
""  17      115     109     55      59      1       
""  13      195     150     60      68      1
""  13      195     150     60      62      1
""  10      5       1       10      61      3
""  13      195     150     60      72      1
""  40      ""      181     70      82      2 #From this row, example data added
""  ""      ""      182     70      85      2
""  429     ""      ""      80      90      3
""  ""      31      12      80      92      3
""  ""      ""      ""      90      95      4
""  118     16      213     90      96      4
')

INVOLVED <- read.table(tc, header=TRUE)

The goal is to place the least recent ID from MATCHINGS into INVOLVED by matching on Track1 to Track4 and Loc. An extra condition is that the Time of the matching INVOLVED entry may not be higher than the Time of the entry in MATCHING. Furthermore a match on Track1 is most preferred, a match on Track4 is least preferred. However only Track4 is always available (all other Track-columns can be empty). Thus the expected results are:

ID Track1 Track2 Track3 Track4 Time Loc
4     15     ""     ""     50   40   1
5     17     ""    109     55   50   1
""    17    432    109     55   65   1
6     17    115    109     55   59   1
7     13    195    150     60   68   1
7     13    195    150     60   62   1
""    10      5      1     10   61   3
8     13    195    150     60   72   1
9     40     ""    181     70   82   2 #From this row, example data added
10    ""     ""    182     70   85   2
11    429    ""     ""     80   90   3
12    ""     31     12     80   92   3
13    ""     ""     ""     90   95   4 
13    118    16    213     90   96   4

I tried to this with the data.table package, but fail in doing this efficient. Is it possible to get rid of the vector scans and efficiently go through the data without looping?

dat <- data.table(MATCHINGS)
for(i in 1:nrow(INVOLVED)){
    row <- INVOLVED[i,]
    match <- dat[Time>=row$Time][Loc==row$Loc][Track4==row$Track4][Track4!=""][order(Time)][1]
    if(!is.na(match$ID)){ INVOLVED$ID[i]<-match$ID }
    match <- dat[Time>=row$Time][Loc==row$Loc][Track3==row$Track3][Track3!=""][order(Time)][1]
    if(!is.na(match$ID)){ INVOLVED$ID[i]<-match$ID }
    match <- dat[Time>=row$Time][Loc==row$Loc][Track2==row$Track2][Track2!=""][order(Time)][1]
    if(!is.na(match$ID)){ INVOLVED$ID[i]<-match$ID }
    match <- dat[Time>=row$Time][Loc==row$Loc][Track1==row$Track1][Track1!=""][order(Time)][1]
    if(!is.na(match$ID)){ INVOLVED$ID[i]<-match$ID }
}

update

Updated the example data showing the need for Track 1 to 3. As shown Track1 is most important and Track4 least important. Even if Track1 to 3 match to MATCHINGS x and Track4 matches to MATCHINGS y, the ID of y should be assigned to that INVOLVED row. So: Track3 match overrides Track4 match, Track2 match overrides Track3 match, Track1 match overrides Track2 match.

Is the data sorted somehow or it could come in any order regarding the matching columns? Based on the example data it seems that you should be able to create a surrogate key that would summarize the Track1-Track4 & Loc in a single unique value. Is this the case? — Valentin Ruano
– Valentin Ruano, Commented Oct 4, 2012 at 10:25
You mean give combinations of Track 1 to 4 and Loc an ID based on their values? So you could match these new surrogate IDs? Since multiple forms of tracking can be missing, I don't think that you would be able to match using a surrogate key. For instance if the entry from MATCHING contains 3 types of tracking and the corresponding entry from INVOLVED contains only two, you would get different keys, even though it should match. — Max van der Heijden
– Max van der Heijden, Commented Oct 4, 2012 at 10:35
I confess that I have difficulty to understand your code... So this means that as long as Loc matches and ANY of the values between Track1-Track4 is then there is a match, right? And, on top of that, you give preference to an matched id coming from Track1 than Track4 based on the order of their match statement in the code right? — Valentin Ruano
– Valentin Ruano, Commented Oct 4, 2012 at 10:47
Yes, that is correct. Track1 is most preferred, track4 least preferred. Track 4 is always there and all other fields can be missing. I also put that in the question. — Max van der Heijden
– Max van der Heijden, Commented Oct 4, 2012 at 11:02
You are not thinking in a data.table way. Reread the vignettes and examples - think how this can be done with merging and by etc. At the moment your code is not using any of the data.table efficiency! — mnel
– mnel, Commented Oct 4, 2012 at 11:20

Arun · Accepted Answer · 2015-10-11 13:40:27Z

5

With roll argument able to also roll next observation backward along with the new (v1.9.6+) on= argument, we can do this much more straightforward:

require(data.table)
setDT(MATCHINGS)
setDT(INVOLVED)
INVOLVED[ , ID := MATCHINGS[INVOLVED, ID, roll=-Inf, 
                    mult="first", on=c("Loc", "Track4", "Time")]]]

That's it.

Here's a data.table-ish start. This only uses Track 4 (not 1 to 3) but it still appears to produce the requested output.

M = as.data.table(MATCHINGS)
I = as.data.table(INVOLVED)
M[,Time:=-Time]
I[,Time:=-Time]
setkey(M,Loc,Track4,Time)
I[,ID:={i=list(Loc,Track4,Time);M[i,ID,roll=TRUE,mult="first"]}][,Time:=-Time]

    ID Track1 Track2 Track3 Track4 Time Loc
 1:  1     NA    105     NA     35    1   1
 2:  1     NA     NA     NA     35    2   1
 3:  1     26    105     NA     35    3   1
 4:  2     NA     NA     NA     40   20   1
 5:  2    134      1      6     40   20   1
 6:  3     13    109     NA     45   30   1
 7:  4     15     NA     NA     50   40   1
 8:  5     17     NA    109     55   50   1
 9: NA     17    432    109     55   65   1
10:  6     17    115    109     55   59   1
11:  7     13    195    150     60   68   1
12:  7     13    195    150     60   62   1
13: NA     10      5      1     10   61   3
14:  8     13    195    150     60   72   1

Interesting question! If this seems ok, please change the example data to need tracks 1 to 3. Or perhaps you can take it from here.

edited Oct 11, 2015 at 13:40

Arun

119k28 gold badges290 silver badges396 bronze badges

answered Oct 4, 2012 at 11:29

Matt Dowle

59.7k24 gold badges172 silver badges224 bronze badges

Sign up to request clarification or add additional context in comments.

12 Comments

mnel Over a year ago

I've never thought of using { within := -- that is pure scoping genius!

Max van der Heijden Over a year ago

Brilliant! This is so very, very fast! Indeed track1 to 3 are not needed in the example data. I will try to come up with an example where 1,2,3 are needed later. One question: why do you use the Time:=-Time calls. Is that for mult="first" to get the oldest instead of newest record?

Matt Dowle Over a year ago

@MaxvanderHeijden Yes, sort of. Because you needed first-after rather the last-before, is perhaps the way I'd word it. roll=TRUE does last-on-or-before. Changing signs changes that to first-on-or-after. But that trick only works for integer and double, not character.

Max van der Heijden Over a year ago

I've updated the example to include Track1 to 3 and thus to take into account the importance of the 4 types of tracking. I've deleted the entries belonging to the first 3 orders to save some space.

Matt Dowle Over a year ago

@MaxvanderHeijden Good. What have you tried? Have you understood the proposed approach and can you extend it?

|

Collectives™ on Stack Overflow

Combine set of conditions in data.table to extract value using binary search

1 Answer 1

12 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

12 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related