I'm running some analysis where I'm getting quite a few datasets that are between 2-3G's. Right now, I'm saving this as .RData file types. Then, later I'm loading these files to continue working, which is taking some time to load in. My question is: would saving then load these files as .csv's be faster. Is data.table the fastest package for reading in .csv files? I guess I'm looking for the optimum workflow in R.
2 Answers
Based on the comments and some of my own research, I put together a benchmark.
library(bench)
nr_of_rows <- 1e7
set.seed(1)
df <- data.frame(
Logical = sample(c(TRUE, FALSE, NA), prob = c(0.85, 0.1, 0.05), nr_of_rows, replace = TRUE),
Integer = sample(1L:100L, nr_of_rows, replace = TRUE),
Real = sample(sample(1:10000, 20) / 100, nr_of_rows, replace = TRUE),
Factor = as.factor(sample(labels(UScitiesD), nr_of_rows, replace = TRUE))
)
baseRDS <- function() {
saveRDS(df, "dataset.Rds")
readRDS("dataset.Rds")
}
baseRDS_nocompress <- function() {
saveRDS(df, "dataset.Rds", compress = FALSE)
readRDS("dataset.Rds")
}
baseRData <- function() {
save(list = "df", file = "dataset.Rdata")
load("dataset.Rdata")
df
}
data.table <- function() {
data.table::fwrite(df, "dataset.csv")
data.table::fread("dataset.csv")
}
feather <- function(variables) {
feather::write_feather(df, "dataset.feather")
as.data.frame(feather::read_feather("dataset.feather"))
}
fst <- function() {
fst::write.fst(df, "dataset.fst")
fst::read.fst("dataset.fst")
}
# only works on Unix systems
# fastSave <- function() {
# fastSave::save.pigz(df, file = "dataset.RData", n.cores = 4)
# fastSave::load.pigz("dataset.RData")
# }
results <- mark(
baseRDS(),
baseRDS_nocompress(),
baseRData(),
data.table(),
feather(),
fst(),
check = FALSE
)
Results
summary(results)
# A tibble: 6 x 13
expression min median `itr/sec` mem_alloc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
1 baseRDS() 15.74s 15.74s 0.0635 191MB
2 baseRDS_nocompress() 720.82ms 720.82ms 1.39 191MB
3 baseRData() 18.14s 18.14s 0.0551 191MB
4 data.table() 4.43s 4.43s 0.226 297MB
5 feather() 794.13ms 794.13ms 1.26 191MB
6 fst() 233.96ms 304.28ms 3.29 229MB
# ... with 8 more variables: `gc/sec` <dbl>, n_itr <int>,
# n_gc <dbl>, total_time <bch:tm>, result <list>,
# memory <list>, time <list>, gc <list>
> summary(results, relative = TRUE)
# A tibble: 6 x 13
expression min median `itr/sec` mem_alloc
<bch:expr> <dbl> <dbl> <dbl> <dbl>
1 baseRDS() 67.3 51.7 1.15 1.00
2 baseRDS_nocompress() 3.08 2.37 25.2 1.00
3 baseRData() 77.5 59.6 1 1.00
4 data.table() 18.9 14.5 4.10 1.56
5 feather() 3.39 2.61 22.8 1
6 fst() 1 1 59.6 1.20
# ... with 8 more variables: `gc/sec` <dbl>, n_itr <int>,
# n_gc <dbl>, total_time <bch:tm>, result <list>,
# memory <list>, time <list>, gc <list>
Based on this, the fst package is the fastest. It's followed by base R on the second place with the option compress = FALSE. This produces large files though. I wouldn't recommend saving anything in csv except you want to open it with a different program. In that case data.table would be your choice. Otherwise I would either recommend saveRDS or fst.
If you are looking for speed for reading CSV, the mentioned vroom package is a good option.
.RData may be slow but, unlike CSV, TSV and whatnot, it has the advantage it can save any R data type: not just tabular data (usually dataframes), but also lists, functions, R6 objects, etc. If you need to save just one dataframe, RDS is faster to write (saveRDS) and load (readRDS) than .RData.
You could also take a look at the new Feather data format developed by Hadley Wickham and Wes McKinney.
Warning forFeather:
What should you not use Feather for?
Feather is not designed for long-term data storage. At this time, we do not guarantee that the file format will be stable between versions. Instead, use Feather for quickly exchanging data between Python and R code, or for short-term storage of data frames as part of some analysis.
(Link is a 2016-03-29 announcement... maybe it is stable now)
data.table::freadis fast for reading CSVs, and for some casesvroompackage is faster: vroom.r-lib.org/articles/benchmarks.htmlfstpackage: fstpackage.org