Best file type for loading data in to R (speed wise)?

Question

I'm running some analysis where I'm getting quite a few datasets that are between 2-3G's. Right now, I'm saving this as .RData file types. Then, later I'm loading these files to continue working, which is taking some time to load in. My question is: would saving then load these files as .csv's be faster. Is data.table the fastest package for reading in .csv files? I guess I'm looking for the optimum workflow in R.

data.table::fread is fast for reading CSVs, and for some cases vroom package is faster: vroom.r-lib.org/articles/benchmarks.html — Jon Spring
– Jon Spring, Commented Nov 4, 2019 at 19:29
I'm not well versed on the workflow pros/cons of the different formats, but I recall seeing a few comparisons in blogs from the last year. Another option you might look at for fast loading is the fst package: fstpackage.org — Jon Spring
– Jon Spring, Commented Nov 4, 2019 at 19:40
Relevant SO answer with suggestions on fast read methods: stackoverflow.com/a/1728422/6851825 — Jon Spring
– Jon Spring, Commented Nov 4, 2019 at 19:48

Alberto Agudo Dominguez · Accepted Answer · 2024-06-20 23:18:46Z

Based on the comments and some of my own research, I put together a benchmark.

library(bench)

nr_of_rows <- 1e7
set.seed(1)
df <- data.frame(
  Logical = sample(c(TRUE, FALSE, NA), prob = c(0.85, 0.1, 0.05), nr_of_rows, replace = TRUE),
  Integer = sample(1L:100L, nr_of_rows, replace = TRUE),
  Real = sample(sample(1:10000, 20) / 100, nr_of_rows, replace = TRUE),
  Factor = as.factor(sample(labels(UScitiesD), nr_of_rows, replace = TRUE))
)

baseRDS <- function() {
  saveRDS(df, "dataset.Rds")
  readRDS("dataset.Rds")
}

baseRDS_nocompress <- function() {
  saveRDS(df, "dataset.Rds", compress = FALSE)
  readRDS("dataset.Rds")
}

baseRData <- function() {
  save(list = "df", file = "dataset.Rdata")
  load("dataset.Rdata")
  df
}

data.table <- function() {
  data.table::fwrite(df, "dataset.csv")
  data.table::fread("dataset.csv")
}
  
feather <- function(variables) {
  feather::write_feather(df, "dataset.feather")
  as.data.frame(feather::read_feather("dataset.feather"))
}

fst <- function() {
  fst::write.fst(df, "dataset.fst")
  fst::read.fst("dataset.fst")
}

# only works on Unix systems
# fastSave <- function() {
#   fastSave::save.pigz(df, file = "dataset.RData", n.cores = 4)
#   fastSave::load.pigz("dataset.RData")
# }

results <- mark(
  baseRDS(),
  baseRDS_nocompress(),
  baseRData(),
  data.table(),
  feather(),
  fst(),
  check = FALSE
)

Results

summary(results)
# A tibble: 6 x 13
  expression                min   median `itr/sec` mem_alloc
  <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>
1 baseRDS()              15.74s   15.74s    0.0635     191MB
2 baseRDS_nocompress() 720.82ms 720.82ms    1.39       191MB
3 baseRData()            18.14s   18.14s    0.0551     191MB
4 data.table()            4.43s    4.43s    0.226      297MB
5 feather()            794.13ms 794.13ms    1.26       191MB
6 fst()                233.96ms 304.28ms    3.29       229MB
# ... with 8 more variables: `gc/sec` <dbl>, n_itr <int>,
#   n_gc <dbl>, total_time <bch:tm>, result <list>,
#   memory <list>, time <list>, gc <list>

> summary(results,  relative = TRUE)
# A tibble: 6 x 13
  expression             min median `itr/sec` mem_alloc
  <bch:expr>           <dbl>  <dbl>     <dbl>     <dbl>
1 baseRDS()            67.3   51.7       1.15      1.00
2 baseRDS_nocompress()  3.08   2.37     25.2       1.00
3 baseRData()          77.5   59.6       1         1.00
4 data.table()         18.9   14.5       4.10      1.56
5 feather()             3.39   2.61     22.8       1   
6 fst()                 1      1        59.6       1.20
# ... with 8 more variables: `gc/sec` <dbl>, n_itr <int>,
#   n_gc <dbl>, total_time <bch:tm>, result <list>,
#   memory <list>, time <list>, gc <list>

Based on this, the fst package is the fastest. It's followed by base R on the second place with the option compress = FALSE. This produces large files though. I wouldn't recommend saving anything in csv except you want to open it with a different program. In that case data.table would be your choice. Otherwise I would either recommend saveRDS or fst.

nice - but you should add the qs -package as well, that is one fast format I tell you

Community · Accepted Answer · 2020-06-20 09:12:55Z

If you are looking for speed for reading CSV, the mentioned vroom package is a good option.

.RData may be slow but, unlike CSV, TSV and whatnot, it has the advantage it can save any R data type: not just tabular data (usually dataframes), but also lists, functions, R6 objects, etc. If you need to save just one dataframe, RDS is faster to write (saveRDS) and load (readRDS) than .RData.

You could also take a look at the new Feather data format developed by Hadley Wickham and Wes McKinney.

Warning forFeather:

What should you not use Feather for?

Feather is not designed for long-term data storage. At this time, we do not guarantee that the file format will be stable between versions. Instead, use Feather for quickly exchanging data between Python and R code, or for short-term storage of data frames as part of some analysis.

(Link is a 2016-03-29 announcement... maybe it is stable now)

Collectives™ on Stack Overflow

Best file type for loading data in to R (speed wise)?

2 Answers 2

Results

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Results

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related