0

I have a large number of RDS files that I need to open, modify and save in place. One iteration of this takes ~1.8s. I have about 40K files that need to be modified, so I attempted to run in parallel. Using 28 processors, it seems like it should take less than an hour to complete, but instead it is taking 4-5x that long. What can be done to fix this? Each file is read by exactly one thread, so there should not be any locking going on. I tried chunking it into blocks of 100 files, but that doesn't help either. I would expect some overheard from the parallel computations, but this seems way out of line to me.

Here is some sample code:

library(parallel)
library(pbapply)

f = function(x){
    y = readRDS(x)
    # modify something in y
    saveRDS(y,x)
}

files = list.files("C:\\my-dir", full.names = T)

cl = makeCluster(28)
result = pbsapply(files, f, cl = cl)
4
  • 2
    Using more cores only affects calculations that happen on the core not e.g. IO. If reading and writing takes 90% of the time using more cores will not make it faster but might even slow things down. One solution might be to distribute files to separate machines. Commented May 11, 2024 at 15:13
  • 2
    You might try saveRDS(y,x, compress = FALSE) if speed is more important than file size. That was 25x faster in this old performance comparison from 2019: stackoverflow.com/questions/58699848/… You might also consider transitioning to a different file type / package altogether, like feather or parquet: chainsawriot.com/postmannheim/2023/09/21/benchmark.html Commented May 11, 2024 at 23:43
  • You could also try the {qs} package which does very quick object serialization. Commented May 12, 2024 at 1:21
  • 1
    The number of cores is typically much larger than the number of IO channels on a chip. Further, the path from disk to RAM crosses several memory hierarchies (L1, L2, L3 cache, etc.). Your cores are competing for and interfering with each other on these resources. A two-level approach with a few readers and many cores processing is usually very scalable on a cluster using MPI and mclappy's fork. Commented May 12, 2024 at 5:18

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.