data.table in R - multiple filters using multiple keys - binary search

Question

I don't understand how I can filter based on multiple keys in data.table. Take the built-in mtcars dataset.

DT <- data.table(mtcars)
setkey(DT, am, gear, carb)

Following the vignette, I know that if I want to have filtering that corresponds to am == 1 & gear == 4 & carb == 4, I can say

> DT[.(1, 4, 4)]
   mpg cyl disp  hp drat    wt  qsec vs am gear carb
1:  21   6  160 110  3.9 2.620 16.46  0  1    4    4
2:  21   6  160 110  3.9 2.875 17.02  0  1    4    4

and it gives the correct result. Furthermore, if I want to have am == 1 & gear == 4 & (carb == 4 | carb == 2), this also works

> DT[.(1, 4, c(4, 2))]
    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1: 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
2: 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
3: 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
4: 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

However, when I want to have am == 1 & (gear == 3 | gear == 4) & (carb == 4 | carb == 2), the plausible

> DT[.(1, c(3, 4), c(4, 2))]
    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1:   NA  NA    NA  NA   NA    NA    NA NA  1    3    4
2: 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
3: 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

fails. Could you please explain to me what is the right approach here?

You'll want to use CJ, like DT[CJ(1,3:4,c(4,2))]. Your approach does not work because it is searching for combo 1,3,4 & combo 1,4,2 only. — Frank
– Frank, Commented Jul 11, 2015 at 0:13

Dean MacGregor · Accepted Answer · 2015-07-11 04:52:04Z

16

The reason you didn't get an error from your query is that data.table will reuse values when they're multiples of other values. In other words, because the 1 for am can be used 2 times, it does this without telling you. If you were to do a query where the number of allowable values weren't multiples of each other then it would give you a warning. For example

DT[.(c(1,0),c(5,4,3),c(8,6,4))]

will give you a warning complaining about a remainder of 1 item, the same error you would see when typing data.table(c(1,0),c(5,4,3),c(8,6,4)). Whenever merging X[Y], both X and Y should be thought of as data.tables.

If you instead use CJ,

DT[CJ(c(1,0),c(5,4,3),c(8,6,4))]

then it will make every combination of all the values for you and data.table will give the results you expect.

From the vignette (bolding is mine):

What’s happening here? Read this again. The value provided for the second key column “MIA” has to find the matching vlaues in dest key column on the matching rows provided by the first key column origin. We can not skip the values of key columns before. Therfore we provide all unique values from key column origin. “MIA” is automatically recycled to fit the length of unique(origin) which is 3.

Just for completeness, the vector scan syntax will work without using CJ

DT[am == 1 & gear == 4 & carb == 4]

or

DT[am == 1 & (gear == 3 | gear == 4) & (carb == 4 | carb == 2)]

How do you know if you need a binary search? If the speed of subsetting is unbearable then you need a binary search. For example, I've got a 48M row data.table I'm playing with and the difference between a binary search and a vector is staggering relative to one another. Specifically a vector scan takes 1.490 seconds in elapsed time but a binary search only takes 0.001 seconds. That, of course, assumes that I've already keyed the data.table. If I include the time it takes to set the key then the combination of setting the key and performing the subset is 1.628. So you have to pick your poison

edited Jul 11, 2015 at 4:52

answered Jul 11, 2015 at 3:39

Dean MacGregor

20k10 gold badges57 silver badges111 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Dean MacGregor Over a year ago

@Frank My "Firstly" point wasn't meant to be accusatory or patronizing in such a way that it'd be deemed fair or not fair. In skimming through the vignette I could see how someone could miss the vector scan syntax and think they had to use the binary search syntax. It seems the target audience of this question will be people just learning data.table for the first time so I don't think it is "unfair" to reiterate that there is a more verbose syntax available. At the end of the day people can ignore the part of the answer that doesn't apply to them but can't read what isn't there.

Arun Over a year ago

Dean, op directly linked to that vignette. The base R like syntax was already covered in the previous vignettes (this is the 3rd in the series of vignettes). But no worries. You are right on about knowing the trade off between setkey + binary search and vector scan. But it is also nice to mention it is particularly advantageous if you'd more than one subset. Also worth mentioning is auto indexing which uses base R syntax as such and optimises internally although it does not optimise this syntax yet. We will try and expand it in the future. Great answer!

Dean MacGregor Over a year ago

I reordered the paragraphs of the answer and removed the "firstly" "secondly" wording so as to give emphasis to the actual answer rather than the comparison to vector scan.

Uwe · Accepted Answer · 2017-02-26 00:13:04Z

This question has now become target of a duplicated question and I felt that the existing answers could be improved to help novice data.table users.

1. What is the difference between `DT[.()]` and `DT[CJ()]`?

According to ?data.table, .() is an alias for list() and a list supplied as parameter i is converted into a data.table internally. So, DT[.(1, c(3, 4), c(2, 4))] is equivalent to DT[data.table(1, c(3, 4), c(2, 4))] with

data.table(1, c(3, 4), c(2, 4))
#   V1 V2 V3
#1:  1  3  2
#2:  1  4  4

The data.table consists of two rows which is the length of the longest vector. 1 is recycled.

This is different to cross join which creates all combinations of the supplied vectors.

CJ(1, c(3, 4), c(2, 4))
   V1 V2 V3
#1:  1  3  2
#2:  1  3  4
#3:  1  4  2
#4:  1  4  4

Note that setDT(expand.grid()) would produce the same result.

This explains why the OP gets two different results:

DT[.(1, c(3, 4), c(2, 4))]
#   mpg cyl disp  hp drat    wt  qsec vs am gear carb
#1:  NA  NA   NA  NA   NA    NA    NA NA  1    3    2
#2:  21   6  160 110  3.9 2.620 16.46  0  1    4    4
#3:  21   6  160 110  3.9 2.875 17.02  0  1    4    4

DT[CJ(1, c(3, 4), c(2, 4))]
#    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#1:   NA  NA    NA  NA   NA    NA    NA NA  1    3    2
#2:   NA  NA    NA  NA   NA    NA    NA NA  1    3    4
#3: 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
#4: 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
#5: 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#6: 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4

Note that the parameter nomatch = 0 will remove the non-matching rows, i.e., the rows containing NA.

2. Using `%in%`

Beside CJ() and am == 1 & (gear == 3 | gear == 4) & (carb == 2 | carb == 4), there is a third equivalent option using value matching:

DT[am == 1 & gear %in%  c(3, 4) & carb %in% c(2, 4)]
#    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#1: 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
#2: 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
#3: 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#4: 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4

Note that CJ() requires the data.tableto be keyed while the two other variants also will work with unkeyed data.tables.

3. Benchmarking

Data

In order to test execution speed of the 3 options we need a much larger data.table than just the 32 rows of mtcars. This is achieved by repeatedly doubling mtcars until 1 million rows (89 MB) are reached. Then this data.table is copied to get a keyed version of the same input data.

library(data.table)
# create unkeyed data.table
DT_unkey <- data.table(mtcars)
for (i in 1:15) {
  DT_unkey <- rbindlist(list(DT_unkey, DT_unkey))
  print(nrow(DT_unkey))
}

#create keyed data.table
DT_keyed <- copy(DT_unkey)
setkeyv(DT_keyed, c("am", "gear", "carb"))

# show data.tables
tables()
#     NAME          NROW NCOL MB COLS                                         KEY         
#[1,] DT_keyed 1,048,576   11 89 mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb am,gear,carb
#[2,] DT_unkey 1,048,576   11 89 mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb             
#Total: 178MB

Run

To get a fair comparison, the setkey() operations are included in the timing. Also, the data.tables are explicitely copied to exclude effects from data.table's update by reference.

With

result <- microbenchmark::microbenchmark(
  setkey = {
    DT_keyed <- copy(DT)
    setkeyv(DT_keyed, c("am", "gear", "carb"))},
  cj_keyed = {
    DT_keyed <- copy(DT)
    setkeyv(DT_keyed, c("am", "gear", "carb")) 
    DT_keyed[CJ(1, c(3, 4), c(2, 4)), nomatch = 0]},
  or_keyed = {
    DT_keyed <- copy(DT)
    setkeyv(DT_keyed, c("am", "gear", "carb")) 
    DT_keyed[am == 1 & (gear == 3 | gear == 4) & (carb == 2 | carb == 4)]},
  or_unkey = {
    copy = DT_unkey <- copy(DT)
    DT_unkey[am == 1 & (gear == 3 | gear == 4) & (carb == 2 | carb == 4)]},
  in_keyed =  {
    DT_keyed <- copy(DT)
    setkeyv(DT_keyed, c("am", "gear", "carb")) 
    DT_keyed[am %in% c(1) & gear %in%  c(3, 4) & carb %in% c(2, 4)]},
  in_unkey = {
    copy = DT_unkey <- copy(DT)
    DT_unkey[am %in% c(1) & gear %in%  c(3, 4) & carb %in% c(2, 4)]},
  times = 10L)

we get

print(result)
#Unit: milliseconds
#     expr       min        lq     mean    median       uq      max neval
#   setkey 198.23972 198.80760 209.0392 203.47035 213.7455 245.8931    10
# cj_keyed 210.03574 212.46850 227.6808 216.00190 254.0678 259.5231    10
# or_keyed 244.47532 251.45227 296.7229 287.66158 291.3811 404.8678    10
# or_unkey  69.78046  75.61220 103.6113  89.32464 111.5240 231.6814    10
# in_keyed 269.82501 270.81692 302.3453 274.42716 321.2935 431.9619    10
# in_unkey  93.75537  95.86832 119.4371 100.19446 126.6605 251.4172    10

ggplot2::autoplot(result)

Apparently, setkey() is a rather costly operations. So, for a one time task the vector scan operations might be faster than using binary search on a keyed table.

The benchmark was run with R version 3.3.2 (x86_64, mingw32), data.table 1.10.4, microbenchmark 1.4-2.1.

Collectives™ on Stack Overflow

data.table in R - multiple filters using multiple keys - binary search

2 Answers 2

3 Comments

1. What is the difference between `DT[.()]` and `DT[CJ()]`?

2. Using `%in%`

3. Benchmarking

Data

Run

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1. What is the difference between DT[.()] and DT[CJ()]?

2. Using %in%

3. Benchmarking

Data

Run

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

1. What is the difference between `DT[.()]` and `DT[CJ()]`?

2. Using `%in%`