R:how to rewrite my coding to work in high efficiency?

Question

I have a matrix (named rating) with dim n x 140000 and another matrix (named trust) with dim nxn where n varying when I change the group and n might have value from 1-15000. I need to multiply each column of rating by trust. for example:

trust=                         rating=
a1 a2 a3 a4 a5                 1 2 3 4 5 6 7 8
b1 b2 b3 b4 b5                 2 5 7 8 9 2 1 6
c1 c2 c3 c4 c5                 3 5 3 6 8 1 2 5 
d1 d2 d3 d4 d5                 4 7 8 2 4 5 6 7
e1 e2 e3 e4 e5                 5 2 5 7 8 9 1 4

answer1=                       answer2=              
a1.1 a2.2 a3.3 a4.4 a5.5       a1.2 a2.5 a3.5 a4.7 a5.2 
b1.1 b2.2 b3.3 b4.4 b5.5       b1.2 b2.5 b3.5 b4.7 b5.2
c1.1 c2.2 c3.3 c4.4 c5.5       c1.2 c2.5 c3.5 c4.7 c5.2
d1.1 d2.2 d3.3 d4.4 d5.5       d1.2 d2.5 d3.5 d4.7 d5.2
e1.1 e2.2 e3.3 e4.4 e5.5       e1.2 e2.5 e3.5 e4.7 e5.2

and answer3 must multiply by 3rd column and so on. Then add each rows of answer1, answer2, ... and store into a vector. Then store each vector into a list for future use.

 for (k in 1:ncol(rating)) {
   clmy <- as.matrix(rating[, k])
   answer <- sweep(trust, MARGIN = 2, clmy, '*')
   sumtrustbyrating <- rowSums(answer)
   LstsumRbyT[[k]] <- sumtrustbyrating
   sumtrustbyrating = NULL
 }

It is working perfectly if I change the ncol(rating) to a small value (about 100). But for the actual data, I have 140000 columns. It takes time and I couldn't get the final execution result. Please help me to enhance the performance of my code for a huge data set.

I do not understand why you have twice the row e1.1 e2.2 e3.3 e4.4 e5.5 in the end of answer1 — amonk
– amonk, Commented May 22, 2017 at 20:36

ekstroem · Accepted Answer · 2017-05-23 05:59:47Z

How about a matrix product? Or is that too slow?

rating <- matrix(c(1, 2, 3, 4, 5,2, 5, 5, 6, 3, 3, 4, 1, 2, 1), ncol=3)
trust <- matrix(rep(1:5, rep(5, 1)), 5, byrow=TRUE)

Running your code above yields

LstsumRbyT
[[1]]
[1] 55 55 55 55 55

[[2]]
[1] 66 66 66 66 66

[[3]]
[1] 27 27 27 27 27

which is the same as

 trust %*% rating
     [,1] [,2] [,3]
[1,]   55   66   27
[2,]   55   66   27
[3,]   55   66   27
[4,]   55   66   27
[5,]   55   66   27

If this isn't enough then this could be improved a bit in RCppArmadillo I guess.

To add to the benchmarking discussion. If your for loop above is renamed f() then I get

microbenchmark(trust %*% rating, f())
Unit: microseconds
             expr     min       lq      mean   median       uq      max neval cld
 trust %*% rating   1.418   1.7010   2.97663   2.7215   3.5965   14.452   100  a 
              f() 593.890 700.9775 764.00515 766.5535 792.6375 1511.104   100   b

which is quite a substantial speedup with the normal matrix product.

amonk · Accepted Answer · 2017-05-23 00:07:31Z

I would vectorize everything:

library(data.table)
set.seed(666)#in order to have reproducible results
n<-10#number of cols and rows
(trust<-matrix(runif(n*n),ncol=n,nrow=n))

          [,1]       [,2]       [,3]      [,4]      [,5]       [,6]      [,7]       [,8]      [,9]      [,10]
  [1,] 0.77436849 0.77589308 0.98422408 0.4697785 0.2444375 0.06913359 0.7748744 0.60379428 0.7659585 0.13247078
  [2,] 0.19722419 0.01637905 0.60134555 0.3976166 0.5309707 0.08462063 0.8120639 0.32826395 0.7758464 0.07851311
  [3,] 0.97801384 0.09574478 0.03834435 0.8046367 0.1183959 0.12994557 0.2606025 0.66611781 0.3125150 0.37822385
  [4,] 0.20132735 0.14216354 0.14149569 0.5088974 0.9833834 0.74613202 0.6515950 0.87478750 0.8422173 0.57962476
  [5,] 0.36124443 0.21112624 0.80638553 0.6349154 0.8977528 0.03887918 0.9238039 0.06887527 0.3141499 0.53642512
 [6,] 0.74261194 0.81125644 0.26668568 0.4942517 0.7385738 0.68563542 0.2661061 0.79346301 0.7565639 0.10853192
 [7,] 0.97872844 0.03654720 0.04270205 0.2801309 0.3773107 0.14397736 0.2661330 0.57142701 0.9675244 0.74031515
 [8,] 0.49811371 0.89163741 0.61217452 0.9087104 0.6061688 0.89107996 0.9109179 0.04894407 0.1694229 0.45178964
 [9,] 0.01331584 0.48323641 0.55334840 0.7841162 0.5121943 0.08963612 0.5905635 0.98035135 0.6968752 0.64610821
[10,] 0.25994613 0.46666453 0.85350077 0.5589970 0.9892467 0.03773272 0.9181476 0.91453735 0.8726508 0.74929873

(rating<-matrix(sample(n*n),ncol=n,nrow=n))

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]   58   19   13   25   23   96   38  100   47    93
 [2,]   37   22   45   41    4   18   52   83   89    39
 [3,]   87   36   15   40   94   11   31   63   35    10
 [4,]   59   88   81   64   68   27   92   56   49    46
 [5,]   24   90    8   44   43   82   14   57   79    66
 [6,]   95   74   48   70    7   33   34   42   60    50
 [7,]   26   65   73   61   32   12   97   98    9    69
 [8,]   21   86    1   99    6   72   75   20   71    62
 [9,]   29   85   55   30   53   80   77    2   28    51
[10,]   67   91   76   16    5    3   84   54   78    17

A function:

  prod1<-function(m1,m2){
   res<-NULL
   if(dim(m1)[1]==dim(m2)[1])
     res<-rbindlist(data.table(rbindlist(data.table(lapply(seq_along(1:nrow(m2)),function(y) {lapply(seq_along(1:nrow(m1)[1]),function(x){m1[,x]*m2[y,x]})})))$V1))
    return(res)
}

will produce: (answer1<-prod1(trust,rating))#sequence of arguments DOES matter

          V1         V2         V3        V4        V5        V6        V7        V8        V9       V10
 1: 44.9133724 14.7419685 12.7949130 11.744463  5.622062  6.636824 29.445226 60.379428 36.000049 12.319782
 2: 11.4390031  0.3112020  7.8174921  9.940414 12.212325  8.123580 30.858427 32.826395 36.464780  7.301719
 3: 56.7248030  1.8191509  0.4984765 20.115918  2.723107 12.474775  9.902897 66.611781 14.688207 35.174818
 4: 11.6769863  2.7011073  1.8394440 12.722435 22.617819 71.628674 24.760610 87.478750 39.584213 53.905103
 5: 20.9521768  4.0113985 10.4830118 15.872884 20.648315  3.732401 35.104546  6.887527 14.765046 49.887537
 6: 43.0714926 15.4138724  3.4669138 12.356293 16.987197 65.821000 10.112033 79.346301 35.558503 10.093469
 7: 56.7662495  0.6943967  0.5551267  7.003272  8.678146 13.821827 10.113054 57.142701 45.473646 68.849309
 8: 28.8905951 16.9411108  7.9582688 22.717759 13.941883 85.543676 34.614880  4.894407  7.962877 42.016436
 9:  0.7723185  9.1814918  7.1935292 19.602904 11.780468  8.605067 22.441414 98.035135 32.753133 60.088064
10: 15.0768755  8.8666260 11.0955099 13.974926 22.752673  3.622341 34.889611 91.453735 41.014587 69.684782

Finally the answer2 is given via the function

prod2<-function(m1,m2){
  res<-NULL
  if(dim(m1)[1]==dim(m2)[1])
    res<-rbindlist(data.table(rbindlist(data.table(lapply(seq_along(2:nrow(m2)),function(y) {lapply(seq_along(2:nrow(m1)[1]),function(x){m1[,x]*m2[y,x+1]})})))$V1))
  return(res)
}

and in particular answer2<-prod2(trust,rating), yielding:

        V1         V2         V3        V4       V5        V6       V7        V8       V9
 1: 14.7130013 10.0866100 24.6056020 10.804906 23.46600  2.627076 77.48744 28.378331 71.23414
 2:  3.7472596  0.2129277 15.0336387  9.145181 50.97318  3.215584 81.20639 15.428406 72.15371
 3: 18.5822630  1.2446822  0.9586087 18.506645 11.36601  4.937932 26.06025 31.307537 29.06390
 4:  3.8252197  1.8481260  3.5373923 11.704640 94.40481 28.353017 65.15950 41.115012 78.32621
 5:  6.8636441  2.7446411 20.1596381 14.603053 86.18427  1.477409 92.38039  3.237138 29.21594
 6: 14.1096269 10.5463338  6.6671419 11.367790 70.90308 26.054146 26.61061 37.292761 70.36044
 7: 18.5958403  0.4751135  1.0675513  6.443011 36.22183  5.471140 26.61330 26.857069 89.97977
 8:  9.4641605 11.5912864 15.3043631 20.900338 58.19221 33.861038 91.09179  2.300371 15.75633
 9:  0.2530009  6.2820733 13.8337100 18.034672 49.17065  3.406172 59.05635 46.076514 64.80939
10:  4.9389764  6.0666389 21.3375191 12.856932 94.96768  1.433843 91.81476 42.983255 81.15652

Benchmarking

library(microbenchmark)
library("ggplot2")
set.seed(666)
global_func<-function(n){
  trust<-matrix(runif(n*n),ncol=n,nrow=n)
  rating<-matrix(sample(n*n),ncol=n,nrow=n)
  prod1<-function(m1,m2){
    res<-NULL
    if(dim(m1)[1]==dim(m2)[1])
     res<-rbindlist(data.table(rbindlist(data.table(lapply(seq_along(1:nrow(m2)),function(y) {lapply(seq_along(1:nrow(m1)[1]),function(x){m1[,x]*m2[y,x]})})))$V1))
   return(res)
 }

 prod2<-function(m1,m2){
   res<-NULL
   if(dim(m1)[1]==dim(m2)[1])
     res<-rbindlist(data.table(rbindlist(data.table(lapply(seq_along(2:nrow(m2)),function(y) {lapply(seq_along(2:nrow(m1)[1]),function(x){m1[,x]*m2[y,x+1]})})))$V1))
   return(res)
   }
   return(list(prod1(trust,rating),prod2(trust,rating)))
}

Let's compare times vs number of cols/rows (n)---Use with caution

tm<-microbenchmark(global_func(10),
              global_func(50),
              global_func(100),
              global_func(500),
              times = 100
              )
 autoplot(tm)

I must thanks that you try to help me. But as I explained I need to multiply one column of rating by whole trust matrix and second column of rating by whole trust matrix and so on. I will do this for whole columns- 140000 times. That's why it takes time. otherwise my code is working perfectly.

Collectives™ on Stack Overflow

R:how to rewrite my coding to work in high efficiency?

2 Answers 2

Comments

Benchmarking

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Benchmarking

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related