1

I have to subset a sequence of data.frames frequently (millions of times each run). The data.frames are of approximate size 200 rows x 30 columns. Depending on the state, the values in the data.frame change from one iteration to the next. Thus, doing one subset in the beginning is not working.

In contrast to the question, when a data.table starts to be faster than a data.frame, I am looking for a speed-up of subsetting for a given size of the data.frame/data.table

The following minimum reproducible example shows, that data.frame seems to be the fastest:

library(data.table)
nmax <- 1e2 # for 1e7 the results look as expected: data.table is really fast!
set.seed(1)
x<-runif(nmax,min=0,max=10)
y<-runif(nmax,min=0,max=10)
DF<-data.frame(x,y)
DT<-data.table(x,y)

summary(microbenchmark::microbenchmark(
  setkey(DT,x,y),
  times = 10L, unit = "us"))

#               expr    min     lq    mean  median      uq     max neval
# 1 setkey(DT, x, y) 70.326 72.606 105.032 80.3985 126.586 212.877    10

summary(microbenchmark::microbenchmark(
  DF[DF$x>5, ], 
  `[.data.frame`(DT,DT$x < 5,),
  DT[x>5],
  times = 100L, unit = "us"))
#                             expr     min      lq      mean   median       uq     max neval
# 1                 DF[DF$x > 5, ]  41.815  45.426  52.40197  49.9885  57.4010  82.110   100
# 2 `[.data.frame`(DT, DT$x < 5, )  43.716  47.707  58.06979  53.5995  61.2020 147.873   100
# 3                      DT[x > 5] 205.273 214.777 233.09221 222.0000 231.6935 900.164   100

Is there anything I can do to improve performance?

Edit after input:

  • I am running a discrete event simulation and for each event I have to search in a list (I don't mind whether it is a data.frame or data.table). Most likely, I could implement a different approach, but then I have to re-write the code which was developed over more than 3 years. At the moment, this is not an option. But if there is no way to get it faster this might become an option in the future.
  • Technically, it is not a sequence of data.frames but just one data.frame, which changes with each iteration. However, this has no impact on "how to get the subset faster" and I hope that the question is now more comprehensive.
4
  • Unless you are subsetting virtually random selection of the data maybe you want to create a list of existing subsets that you can refer to instead of refitering the dataframe in the same ways over and over? Commented Apr 10, 2017 at 12:30
  • 1
    There is overhead involved in data.table subsetting. See also stackoverflow.com/a/20179189/1412059 Commented Apr 10, 2017 at 13:07
  • 4
    You should ask a proper question about your actual problem. If you are subsetting data.frames millions of times your approach is wrong. Commented Apr 10, 2017 at 13:09
  • 2
    Please reopen the question. I can't find an answer anywhere and I think the question should be clear after the edits. Commented Apr 11, 2017 at 15:59

1 Answer 1

1

You will see a performance boost by converting to matrices. This is a viable alternative if the whole content of your data.frame is numerical (or can be converted without too much trouble).

Here we go. First I modified the data to have it with size 200x30:

library(data.table)
nmax = 200
cmax = 30
set.seed(1)
x<-runif(nmax,min=0,max=10)
DF = data.frame(x)
for (i in 2:cmax) {
  DF = cbind(DF, runif(nmax,min=0,max=10))
  colnames(DF)[ncol(DF)] = paste0('x',i)
}
DT = data.table(DF)
DM = as.matrix(DF)    # # # or data.matrix(DF) if you have factors

And the comparison, ranked from quickest to slowest:

summary(microbenchmark::microbenchmark(
  DM[DM[, 'x']>5, ], # # # # Quickest
  as.matrix(DF)[DF$x>5, ], # # # # Still quicker with conversion
  DF[DF$x>5, ], 
  `[.data.frame`(DT,DT$x < 5,),
  DT[x>5],
  times = 100L, unit = "us"))

#                             expr     min       lq      mean   median       uq      max neval
# 1            DM[DM[, "x"] > 5, ]  13.883  19.8700  22.65164  22.4600  24.9100   41.107   100
# 2      as.matrix(DF)[DF$x > 5, ] 141.100 181.9140 196.02329 195.7040 210.2795  304.989   100
# 3                 DF[DF$x > 5, ] 198.846 238.8085 260.07793 255.6265 278.4080  377.982   100
# 4 `[.data.frame`(DT, DT$x < 5, ) 212.342 268.2945 346.87836 289.5885 304.2525 5894.712   100
# 5                      DT[x > 5] 322.695 396.3675 465.19192 428.6370 457.9100 4186.487   100

If your use-case involves querying multiple times the data, then you can do the conversion only once and increase the speed by one order of magnitude.

Sign up to request clarification or add additional context in comments.

1 Comment

Cool, I already gave up, but it seems there is always somebody with a good idea :-)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.