9

I'm running some analysis where I'm getting quite a few datasets that are between 2-3G's. Right now, I'm saving this as .RData file types. Then, later I'm loading these files to continue working, which is taking some time to load in. My question is: would saving then load these files as .csv's be faster. Is data.table the fastest package for reading in .csv files? I guess I'm looking for the optimum workflow in R.

4
  • 1
    data.table::fread is fast for reading CSVs, and for some cases vroom package is faster: vroom.r-lib.org/articles/benchmarks.html Commented Nov 4, 2019 at 19:29
  • 1
    @JonSpring, is saving to .RData not really recommended? Commented Nov 4, 2019 at 19:34
  • I'm not well versed on the workflow pros/cons of the different formats, but I recall seeing a few comparisons in blogs from the last year. Another option you might look at for fast loading is the fst package: fstpackage.org Commented Nov 4, 2019 at 19:40
  • Relevant SO answer with suggestions on fast read methods: stackoverflow.com/a/1728422/6851825 Commented Nov 4, 2019 at 19:48

2 Answers 2

29

Based on the comments and some of my own research, I put together a benchmark.

library(bench)

nr_of_rows <- 1e7
set.seed(1)
df <- data.frame(
  Logical = sample(c(TRUE, FALSE, NA), prob = c(0.85, 0.1, 0.05), nr_of_rows, replace = TRUE),
  Integer = sample(1L:100L, nr_of_rows, replace = TRUE),
  Real = sample(sample(1:10000, 20) / 100, nr_of_rows, replace = TRUE),
  Factor = as.factor(sample(labels(UScitiesD), nr_of_rows, replace = TRUE))
)

baseRDS <- function() {
  saveRDS(df, "dataset.Rds")
  readRDS("dataset.Rds")
}

baseRDS_nocompress <- function() {
  saveRDS(df, "dataset.Rds", compress = FALSE)
  readRDS("dataset.Rds")
}

baseRData <- function() {
  save(list = "df", file = "dataset.Rdata")
  load("dataset.Rdata")
  df
}

data.table <- function() {
  data.table::fwrite(df, "dataset.csv")
  data.table::fread("dataset.csv")
}
  
feather <- function(variables) {
  feather::write_feather(df, "dataset.feather")
  as.data.frame(feather::read_feather("dataset.feather"))
}

fst <- function() {
  fst::write.fst(df, "dataset.fst")
  fst::read.fst("dataset.fst")
}

# only works on Unix systems
# fastSave <- function() {
#   fastSave::save.pigz(df, file = "dataset.RData", n.cores = 4)
#   fastSave::load.pigz("dataset.RData")
# }

results <- mark(
  baseRDS(),
  baseRDS_nocompress(),
  baseRData(),
  data.table(),
  feather(),
  fst(),
  check = FALSE
)

Results

summary(results)
# A tibble: 6 x 13
  expression                min   median `itr/sec` mem_alloc
  <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>
1 baseRDS()              15.74s   15.74s    0.0635     191MB
2 baseRDS_nocompress() 720.82ms 720.82ms    1.39       191MB
3 baseRData()            18.14s   18.14s    0.0551     191MB
4 data.table()            4.43s    4.43s    0.226      297MB
5 feather()            794.13ms 794.13ms    1.26       191MB
6 fst()                233.96ms 304.28ms    3.29       229MB
# ... with 8 more variables: `gc/sec` <dbl>, n_itr <int>,
#   n_gc <dbl>, total_time <bch:tm>, result <list>,
#   memory <list>, time <list>, gc <list>

> summary(results,  relative = TRUE)
# A tibble: 6 x 13
  expression             min median `itr/sec` mem_alloc
  <bch:expr>           <dbl>  <dbl>     <dbl>     <dbl>
1 baseRDS()            67.3   51.7       1.15      1.00
2 baseRDS_nocompress()  3.08   2.37     25.2       1.00
3 baseRData()          77.5   59.6       1         1.00
4 data.table()         18.9   14.5       4.10      1.56
5 feather()             3.39   2.61     22.8       1   
6 fst()                 1      1        59.6       1.20
# ... with 8 more variables: `gc/sec` <dbl>, n_itr <int>,
#   n_gc <dbl>, total_time <bch:tm>, result <list>,
#   memory <list>, time <list>, gc <list>

Based on this, the fst package is the fastest. It's followed by base R on the second place with the option compress = FALSE. This produces large files though. I wouldn't recommend saving anything in csv except you want to open it with a different program. In that case data.table would be your choice. Otherwise I would either recommend saveRDS or fst.

Sign up to request clarification or add additional context in comments.

2 Comments

nice - but you should add the qs -package as well, that is one fast format I tell you
what an amazing answer - I learned a lot from it. Thank you
1

If you are looking for speed for reading CSV, the mentioned vroom package is a good option.

.RData may be slow but, unlike CSV, TSV and whatnot, it has the advantage it can save any R data type: not just tabular data (usually dataframes), but also lists, functions, R6 objects, etc. If you need to save just one dataframe, RDS is faster to write (saveRDS) and load (readRDS) than .RData.

You could also take a look at the new Feather data format developed by Hadley Wickham and Wes McKinney.

Warning forFeather:

What should you not use Feather for?

Feather is not designed for long-term data storage. At this time, we do not guarantee that the file format will be stable between versions. Instead, use Feather for quickly exchanging data between Python and R code, or for short-term storage of data frames as part of some analysis.

(Link is a 2016-03-29 announcement... maybe it is stable now)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.