I have to subset a sequence of data.frames frequently (millions of times each run). The data.frames are of approximate size 200 rows x 30 columns. Depending on the state, the values in the data.frame change from one iteration to the next. Thus, doing one subset in the beginning is not working.
In contrast to the question, when a data.table starts to be faster than a data.frame, I am looking for a speed-up of subsetting for a given size of the data.frame/data.table
The following minimum reproducible example shows, that data.frame seems to be the fastest:
library(data.table)
nmax <- 1e2 # for 1e7 the results look as expected: data.table is really fast!
set.seed(1)
x<-runif(nmax,min=0,max=10)
y<-runif(nmax,min=0,max=10)
DF<-data.frame(x,y)
DT<-data.table(x,y)
summary(microbenchmark::microbenchmark(
setkey(DT,x,y),
times = 10L, unit = "us"))
# expr min lq mean median uq max neval
# 1 setkey(DT, x, y) 70.326 72.606 105.032 80.3985 126.586 212.877 10
summary(microbenchmark::microbenchmark(
DF[DF$x>5, ],
`[.data.frame`(DT,DT$x < 5,),
DT[x>5],
times = 100L, unit = "us"))
# expr min lq mean median uq max neval
# 1 DF[DF$x > 5, ] 41.815 45.426 52.40197 49.9885 57.4010 82.110 100
# 2 `[.data.frame`(DT, DT$x < 5, ) 43.716 47.707 58.06979 53.5995 61.2020 147.873 100
# 3 DT[x > 5] 205.273 214.777 233.09221 222.0000 231.6935 900.164 100
Is there anything I can do to improve performance?
Edit after input:
- I am running a discrete event simulation and for each event I have to search in a list (I don't mind whether it is a
data.frameordata.table). Most likely, I could implement a different approach, but then I have to re-write the code which was developed over more than 3 years. At the moment, this is not an option. But if there is no way to get it faster this might become an option in the future. - Technically, it is not a sequence of
data.framesbut just onedata.frame, which changes with each iteration. However, this has no impact on "how to get the subset faster" and I hope that the question is now more comprehensive.