How to speed up the following function in R?

Question

I have a data frame called "marketdata" which contains 3,000,000 rows (rownames: 1 to 3,000,000) and 2 columns (colnames: "mid", "bo").

> head(marketdata)
    mid    bo  
1   250    0.05
2   251    0.07
3   252    0.13
4   249    0.08
5   250    0.12

My function is as follows:

movingWindow <- function (submarketdata) {
   temp <- submarketdata[submarketdata$bo <= 0.1, ]   
   return( c(mean(temp$mid), NROW(temp)/100) )
}

result <- lapply(c(101:NROW(marketdata)), function(i) movingWindow( marketdata[ (i-99):i , ] ))

For example, for row 101, I will search through marketdata[2:101,]. Then find out those rows that have "bo" value <= 0.1 as "effective sample". And finally calculate the average of these "effective samples" and the percentage of them.

However, this script runs really slow. It took about 15 minutes to finish all the 3,000,000 lines. Could any one help me to speed this up? Thank you.

You need to provide a representative sample of marketdata. Please read this FAQ. — Roland
– Roland, Commented Dec 12, 2013 at 19:49
Well, you are achieving approx. 3333 outputs per second. There's no need for your movingWindow function, as each of its outputs is easily vectorized. — Carl Witthoft
– Carl Witthoft, Commented Dec 12, 2013 at 20:21

Roland · Accepted Answer · 2013-12-13 13:05:39Z

3

set.seed(42)
marketdata <- data.frame(mid=runif(200, 245, 255),
                 bo=runif(200, 0, 0.2))

movingWindow <- function (submarketdata) {
  temp <- submarketdata[submarketdata$bo <= 0.1, ]   
  return( c(mean(temp$mid), NROW(temp)/100) )
}

result <- t(sapply(c(101:NROW(marketdata)), function(i) movingWindow( marketdata[ (i-99):i , ] )))

#faster alternative:
library(zoo)
r1 <- rollmean(marketdata$bo <= 0.1, 100)
all.equal(r1[-1], result[,2])

r2 <- rollsum((marketdata$bo <= 0.1)*marketdata$mid, 100)/(100*r1)

result2 <- cbind(r2, r1)

#same result?
all.equal(result, unname(result2[-1,]))
#[1] TRUE

#base R alternative (assuming there are no NA values in your data)
r1a <- na.omit(filter(marketdata$bo <= 0.1, rep(0.01, 100)))
r2a <- na.omit(filter((marketdata$bo <= 0.1)*marketdata$mid, rep(1, 100)))/(100*r1a)
result2a <- cbind(r2a, r1a)

#same result?
all.equal(result, unname(result2a[-1,]))
#[1] TRUE

The alternatives give one value more (the first value). Otherwise the results are identical and both alternatives are much faster.

Benchmarks for the example:

Unit: microseconds
        expr        min        lq    median        uq       max neval
    original  19006.144 19435.262 20824.245 21243.524 52965.168   100
alternative1   1444.574  1525.774  1607.264  1646.524  3486.940   100
alternative2    975.366  1006.913  1071.305  1106.437  3117.709   100

edited Dec 13, 2013 at 13:05

answered Dec 12, 2013 at 20:29

Roland

134k12 gold badges203 silver badges305 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Carl Witthoft Over a year ago

Shouldn't your Alt#2 r2a equation call r1a, not r1 ?

Hang Over a year ago

Hi Roland, thank you very much for your post. However I feel that this alternative method generate a different result with my original method. In my function, I'm calculating the average of samples with column "bo" <= 0.1 WITHIN the previous 100 updates. While I think yours calculate the previous 100 sample which satisfies "bo" <= 0.1.

Hang Over a year ago

Oh, I see. I did make a mistake. The average is actually calculated using the "rollsum"/(100*r1), instead of the first rollmean function which returns the amount of effective samples. Thank you very much.

Collectives™ on Stack Overflow

How to speed up the following function in R?

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related