I need some help making calculating the cosine similarity score of vectors in a term document matrix much faster. I have a matrix of strings and I need to get the word similarity scores between the strings in each row of the matrix.
I am using the 'tm' package to create a term document matrix for each row of a data frame of text strings and the lsa package to get the cosine similarity score between the two vectors of words in the strings. I'm also using apply() to run the function below on an entire data frame:
similarity_score <- function (x) {
x <- VectorSource(x)
x <- Corpus(x)
x <- tm_map(x, tolower)
x <- tm_map(x, removePunctuation)
x <- tm_map(x, removeNumbers)
x <- tm_map(x, removeWords, stopwords("english"))
x <- tm_map(x, stemDocument)
x <- tm_map(x, stripWhitespace)
x <- tm_map(x, PlainTextDocument)
x <- TermDocumentMatrix(x)
x <- as.matrix(x)
return(as.numeric(cosine(x[,1], x[,2])))
apply_similarity <- function(x) {
return(as.data.frame(apply(x , 1, similarity_score)))
}
list_data_frames <- list(df_1, df_2, df_3,...)
output <- as.data.frame(lapply(list_data_frames, apply_similarity))
It gives me the values I need but I need to do this on a massive dataset and it is extremely slow. Running it on 1% of the dataset took 3 hours on my local machine. I need to do this on around 40 different data frames so I am using lapply on a list of data frames and applying that function to each data frame.
1) Is there a better way to do this that is faster? Maybe with another package or more efficient code? Am I using apply and lapply wrong in my code?
2) Can I parallelize this code and run it on multiple processors?
I tried using the snowfall package and the sfLapply and sfapply function but when the clone is created with snowfall it doesn't load packages and cannot find the function from the 'tm package. If I end up doing this on amazons cloud is there a way to have R use more than one processor and run functions within packages like 'tm' on multiple cores?