1

I need some help making calculating the cosine similarity score of vectors in a term document matrix much faster. I have a matrix of strings and I need to get the word similarity scores between the strings in each row of the matrix.

I am using the 'tm' package to create a term document matrix for each row of a data frame of text strings and the lsa package to get the cosine similarity score between the two vectors of words in the strings. I'm also using apply() to run the function below on an entire data frame:

similarity_score <- function (x) {
  x <- VectorSource(x)
  x <- Corpus(x)
  x <- tm_map(x, tolower)
  x <- tm_map(x, removePunctuation)
  x <- tm_map(x, removeNumbers)
  x <- tm_map(x, removeWords, stopwords("english"))
  x <- tm_map(x, stemDocument)
  x <- tm_map(x, stripWhitespace)
  x <- tm_map(x, PlainTextDocument)
  x <- TermDocumentMatrix(x)
  x <- as.matrix(x)
  return(as.numeric(cosine(x[,1], x[,2])))

apply_similarity <- function(x) {
  return(as.data.frame(apply(x , 1, similarity_score)))
}

list_data_frames <- list(df_1, df_2, df_3,...)

output <- as.data.frame(lapply(list_data_frames, apply_similarity))

It gives me the values I need but I need to do this on a massive dataset and it is extremely slow. Running it on 1% of the dataset took 3 hours on my local machine. I need to do this on around 40 different data frames so I am using lapply on a list of data frames and applying that function to each data frame.

1) Is there a better way to do this that is faster? Maybe with another package or more efficient code? Am I using apply and lapply wrong in my code?

2) Can I parallelize this code and run it on multiple processors?

I tried using the snowfall package and the sfLapply and sfapply function but when the clone is created with snowfall it doesn't load packages and cannot find the function from the 'tm package. If I end up doing this on amazons cloud is there a way to have R use more than one processor and run functions within packages like 'tm' on multiple cores?

1
  • Do you need to do the pre-processing online (in the function)? Commented Aug 17, 2017 at 6:07

1 Answer 1

1

Have you tried using the parallel package? There is a good guide on gforge here. Essentially, you just have to start up a cluster, load the libraries into the nodes with clusterEvalQ, and you should be good to go. I'm trying it out now. Then again, this answer is coming more than a year old and you've probably got a good solution to this. Do share with me if you've come across something.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.