Scaling and parallel processing 'tm' package Term-Document Matrix calculations in R studio?

Question

I need some help making calculating the cosine similarity score of vectors in a term document matrix much faster. I have a matrix of strings and I need to get the word similarity scores between the strings in each row of the matrix.

I am using the 'tm' package to create a term document matrix for each row of a data frame of text strings and the lsa package to get the cosine similarity score between the two vectors of words in the strings. I'm also using apply() to run the function below on an entire data frame:

similarity_score <- function (x) {
  x <- VectorSource(x)
  x <- Corpus(x)
  x <- tm_map(x, tolower)
  x <- tm_map(x, removePunctuation)
  x <- tm_map(x, removeNumbers)
  x <- tm_map(x, removeWords, stopwords("english"))
  x <- tm_map(x, stemDocument)
  x <- tm_map(x, stripWhitespace)
  x <- tm_map(x, PlainTextDocument)
  x <- TermDocumentMatrix(x)
  x <- as.matrix(x)
  return(as.numeric(cosine(x[,1], x[,2])))

apply_similarity <- function(x) {
  return(as.data.frame(apply(x , 1, similarity_score)))
}

list_data_frames <- list(df_1, df_2, df_3,...)

output <- as.data.frame(lapply(list_data_frames, apply_similarity))

It gives me the values I need but I need to do this on a massive dataset and it is extremely slow. Running it on 1% of the dataset took 3 hours on my local machine. I need to do this on around 40 different data frames so I am using lapply on a list of data frames and applying that function to each data frame.

1) Is there a better way to do this that is faster? Maybe with another package or more efficient code? Am I using apply and lapply wrong in my code?

2) Can I parallelize this code and run it on multiple processors?

I tried using the snowfall package and the sfLapply and sfapply function but when the clone is created with snowfall it doesn't load packages and cannot find the function from the 'tm package. If I end up doing this on amazons cloud is there a way to have R use more than one processor and run functions within packages like 'tm' on multiple cores?

Do you need to do the pre-processing online (in the function)? — user1603472
– user1603472, Commented Aug 17, 2017 at 6:07

Lucas Neo · Accepted Answer · 2015-11-15 03:23:43Z

1

Have you tried using the parallel package? There is a good guide on gforge here. Essentially, you just have to start up a cluster, load the libraries into the nodes with clusterEvalQ, and you should be good to go. I'm trying it out now. Then again, this answer is coming more than a year old and you've probably got a good solution to this. Do share with me if you've come across something.

answered Nov 15, 2015 at 3:23

Lucas Neo

536 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Scaling and parallel processing 'tm' package Term-Document Matrix calculations in R studio?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related