1

I'm trying to read in a lot of data as efficiently as possible; the data are in about 1,400 CSV files within 6 individual ZIP files. The CSV files are all similar timeseries data with the same columns, except some CSV files have an extra column ("yield_c" in this example). I need to import all data and would prefer to not unzip the ZIP files. I've had success with a few approaches but both have their problems; we'll call them A and B. Approach A seems to run much faster than B, but it is adding each CSV header as a new data row and (likely related) is causing all columns to be read as character. Approach B processes the full dataset correctly but is abominably slow.

Any tips on coding this better for efficiency and a proper data pull? I think I am missing something simple about how freadr processes the cmd input but I can't figure out how to make approach A work correctly without adding the header rows every time.

Here is a sample dataset of "only" 240 CSV files among 6 ZIP files with my two data processor functions datprocessorA() and datprocessorB(). Currently (for me) approach A takes a few seconds and approach B takes 2.5 minutes.

Apologies for any formatting or etiquette issues, I don't post here as I can usually find a solution. Similarly, apologies if this is a dupe but I've scoured the forums and haven't found anything to help this specific case; plenty of related issues but not this one in particular.

library(data.table)

# Create sample CSV files and zip them in a folder

tempath <- "temp"
dir.create(file.path(tempath), showWarnings = FALSE)

setwd(tempath)

dfs <- list()
num_dfs <- 240

for (i in 1:num_dfs) {
  
  data <- data.frame(date = seq(as.Date("1980-10-01"), as.Date("2020-09-01"), by = "month"),
                     Subset = paste0("Subset", i),
                     yield_a = sample(0:20,480, replace = TRUE),
                     yield_b = sample(0:5,480, replace = TRUE))
  
  df_name <- paste0("df", i)
  
  dfs[[df_name]] <- data
}

# randomly (ish) add "yield_c" column to a subset of dataframes
seqz <- sort(unique(c(1,sample(seq(1,num_dfs, by = 5), size = num_dfs%/%3, replace = TRUE))))

for (j in seqz) {
  dfs[[j]]$yield_c = sample(1:10,480, replace = TRUE)
}

# write all to csv, zip up, delete csvs
for (k in seq_along(dfs)) {
  file_name <- paste0("file", k, ".csv") # Create a unique filename
  write.csv(dfs[[k]], file = file_name, row.names = FALSE) # Save to CSV
}

zipcount <- 6
for (l in 1:zipcount) {
  zip(paste0("zip",l,".zip"), paste0("file", seq(num_dfs/zipcount*(l-1)+1,num_dfs/zipcount*l,1), ".csv"))
}

file.remove(list.files(pattern = "\\.csv$", full.names = TRUE))
# import

# return to parent directory to more closely imitate my scenario
setwd("..")

#get path for all zip files
zips <- paste0(tempath, "/", list.files(path = tempath, full.names = F, recursive = F))

# first function to import all csvs into R as a table

datprocessorA <- function(zipdir) {
  
  connz <- paste("unzip -p",zipdir)
  datz <- rbindlist(lapply(connz, function(x) fread(x, sep=',', header = TRUE,
                                                   stringsAsFactors=FALSE, fill = Inf)),
                    use.names = TRUE, fill=TRUE)

  datz
}

# alternative function to import all csvs into R as a table
datprocessorB <- function(zipdir) {
  
  datz <- lapply(setNames(nm = unzip(zipdir, list = TRUE)$Name),
                 function(fn) fread(cmd = paste("unzip -p -a",zipdir, fn), header = TRUE,
                                    stringsAsFactors=FALSE, fill = Inf)) %>%
    rbindlist(use.names = TRUE, fill=TRUE)
  datz
}

# for loop to process import function; choose datprocessorA or datprocessorB

funx <- datprocessorA

begin <- Sys.time()
for (i in 1:length(zips)) {
  
  zipdir <- zips[i]
  
  if (i == 1) {
    import  <- funx(zipdir)
  } else {
    dat2 <- funx(zipdir)
    import  <- rbind(import, dat2, fill = TRUE)
    rm(dat2)
  }
}
end <- Sys.time()

print(paste0("Approach A length: ",round(end-begin), " seconds"))

funx <- datprocessorB

begin <- Sys.time()
for (i in 1:length(zips)) {
  
  zipdir <- zips[i]
  
  if (i == 1) {
    import  <- funx(zipdir)
  } else {
    dat2 <- funx(zipdir)
    import  <- rbind(import, dat2, fill = TRUE)
    rm(dat2)
  }
}
end <- Sys.time()

print(paste0("Approach B length: ",round(end-begin), " minutes"))

3 Answers 3

3

The following function seems to do what you want and is fast.
It reads all files in the zips passed in a vector zip_vector, with no need for a for loop.

datprocessorC <- function(zip_vector) {
  f <- function(zipfile) {
    fls <- unzip(zipfile, list = TRUE)$Name
    lapply(fls, \(f) fread(unzip(zipfile, files = f))) |> rbindlist(fill = TRUE)
  }
  lapply(zip_vector, f) |> rbindlist(fill = TRUE)
}

# run on the zip files vector returned by list.files()
importC <- datprocessorC(zips)

Edit

Here is another function that reads the files with unz/readr::read_csv without unzipping. No file1.csv, file2.csv, etc are created.

This function takes about twice the time of datprocessorC to extract the data.

datprocessorD <- function(zip_vector) {
  f <- function(zipfile) {
    fls <- unzip(zipfile, list = TRUE)$Name
    out <- vector("list", length(fls))
    for(i in seq_along(fls)) {
      tmp <- unz(zipfile, filename = fls[i])
      out[[i]] <- readr::read_csv(file = tmp, show_col_types = FALSE)
    }
    data.table::rbindlist(out, fill = TRUE)
  }
  lapply(zip_vector, f) |> data.table::rbindlist(fill = TRUE)
}
Sign up to request clarification or add additional context in comments.

5 Comments

30 seconds on my real dataset, columns are set to their proper classes, and no repeating header rows. This is great except it does physically unzip all of the files into the working directory. I can either: edit to unzip to a temp directory, write a line of code to delete after, or get over it. Thanks!
@Riley Glad it helped. I have edited the answer with another function that doesn't unzip the files but is slower. See if it can help to solve your use case.
"with no need for a for loop."--there are hidden loops (lapply()s), where one could be avoided.
@Friede That's true, thanks. Anyway, the OP doesn't want to unzip the files and unz only extracts one file within the .zip at a time, so the inner loop is needed.
Thanks Rui, I'm finding it easier and faster to just unzip to a temp directory and delete after. This was very helpful, much appreciated!
0

Sorry, no fread(), but DuckDB combined with zipfs community extension is also reasonably fast, even though zipfs devs warn that its more about convenience than high performance - Performance considerations . And globbing for both zip files and archive content is indeed nice.

DuckDB through duckplyr, timing stops after full materialization:

library(duckplyr)
db_exec("INSTALL zipfs FROM community")
db_exec("LOAD zipfs")

tictoc::tic()
import_dd <-
  read_csv_duckdb(
    path = "zip://temp/*.zip/*.csv", 
    options = list(union_by_name = TRUE)
  ) |> 
  data.table::setDT()
tictoc::toc()
#> 2.71 sec elapsed

import_dd
#>               date    Subset yield_a yield_b yield_c
#>             <Date>    <char>   <num>   <num>   <num>
#>      1: 1980-10-01   Subset1       4       2       3
#>      2: 1980-11-01   Subset1       8       1       3
#>      3: 1980-12-01   Subset1       3       1       9
#>      4: 1981-01-01   Subset1       1       1       2
#>      5: 1981-02-01   Subset1      17       0       9
#>     ---                                             
#> 115196: 2020-05-01 Subset240      18       0      NA
#> 115197: 2020-06-01 Subset240       3       3      NA
#> 115198: 2020-07-01 Subset240       5       3      NA
#> 115199: 2020-08-01 Subset240      15       0      NA
#> 115200: 2020-09-01 Subset240       6       4      NA

This was with default DuckDB settings, it sets thread count to the number of CPU cores, 8 in my case (Core i7).

Comments

0

Using the external tar command (which also unzips and is included with Windows 11 and most Linux distributions) iterate over the zip files running tar and excluding header rows and then read that in with fread. This is for Windows 11. On Linux use the commented out lines instead of the 2 lines prior to them.

library(data.table)

p <- proc.time()

col.names <- c("date", "Subset", "yield_a", "yield_b", "yield_c")

tar <- Sys.which("tar")[[1]]
fmt <- paste(tar, "-xOf %s | findstr /b /v .date")
# unzip <- Sys.which("unzip")[[1]]
# fmt <- paste(unzip, "-p %s | grep -v date")  # uncomment on Linux

res <- Sys.glob("*.zip") |>
  lapply(\(x) fread(cmd = sprintf(fmt,x), col.names = col.names, fill = TRUE)) |>
  rbindlist()

proc.time() - p
##    user  system elapsed 
##    0.02    0.14    0.85 

Old

Using the external 7z program this takes 7 seconds on a Windows 11 system (i7 processor). It does not create any intermediate files. It first reads and concatenates in all the files in each zip file and then splits out the header lines keeping only the longest which it puts up front and then reads that text as a csv.

library(magrittr)

p <- proc.time()

dat <- zips %>%
  lapply(function(file) shell(paste("7z x -so", file), intern = TRUE)) %>%
  unlist %>%
  split(!grepl("date", .)) %>%
  within(`FALSE` <- `FALSE`[which.max(nchar(`FALSE`))]) %>%
  unlist %>%
  read.csv(text = .)

proc.time() - p
##    user  system elapsed 
##    0.87    4.80    7.02 

Using gawk we can process this in under a second on my PC. This is for Widows but it woud not be hard to modify it for Linux.

dir/b *.zip | gawk -e "{ system(\"7z.exe x -so \" $0 ) }" > all.csv  
findstr /b .date all.csv | sort /r | gawk "NR==1" > final.csv
findstr /b "[0-9]" all.csv >> final.csv

1 Comment

Revised to use tar.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.