R - Read and Modify Large Number of Files in Parallel

Ask Question

Asked 1 year, 6 months ago

Modified 1 year, 6 months ago

Viewed 49 times

Part of R Language Collective

I have a large number of RDS files that I need to open, modify and save in place. One iteration of this takes ~1.8s. I have about 40K files that need to be modified, so I attempted to run in parallel. Using 28 processors, it seems like it should take less than an hour to complete, but instead it is taking 4-5x that long. What can be done to fix this? Each file is read by exactly one thread, so there should not be any locking going on. I tried chunking it into blocks of 100 files, but that doesn't help either. I would expect some overheard from the parallel computations, but this seems way out of line to me.

Here is some sample code:

library(parallel)
library(pbapply)

f = function(x){
    y = readRDS(x)
    # modify something in y
    saveRDS(y,x)
}

files = list.files("C:\\my-dir", full.names = T)

cl = makeCluster(28)
result = pbsapply(files, f, cl = cl)

edited May 11, 2024 at 14:54

asked May 11, 2024 at 14:53

FSU79

551 silver badge6 bronze badges

2

Using more cores only affects calculations that happen on the core not e.g. IO. If reading and writing takes 90% of the time using more cores will not make it faster but might even slow things down. One solution might be to distribute files to separate machines.

Andre Wildberg
– Andre Wildberg

2024-05-11 15:13:05 +00:00
Commented May 11, 2024 at 15:13
2

You might try saveRDS(y,x, compress = FALSE) if speed is more important than file size. That was 25x faster in this old performance comparison from 2019: stackoverflow.com/questions/58699848/… You might also consider transitioning to a different file type / package altogether, like feather or parquet: chainsawriot.com/postmannheim/2023/09/21/benchmark.html

Jon Spring
– Jon Spring

2024-05-11 23:43:32 +00:00
Commented May 11, 2024 at 23:43
You could also try the {qs} package which does very quick object serialization.

DavoOZ
– DavoOZ

2024-05-12 01:21:17 +00:00
Commented May 12, 2024 at 1:21
1

The number of cores is typically much larger than the number of IO channels on a chip. Further, the path from disk to RAM crosses several memory hierarchies (L1, L2, L3 cache, etc.). Your cores are competing for and interfering with each other on these resources. A two-level approach with a few readers and many cores processing is usually very scalable on a cluster using MPI and mclappy's fork.

George Ostrouchov
– George Ostrouchov

2024-05-12 05:18:34 +00:00
Commented May 12, 2024 at 5:18

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

R - Read and Modify Large Number of Files in Parallel

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked