How can I read multiple csv files from within multiple zip files with sometimes different columns using R fread?

Question

I'm trying to read in a lot of data as efficiently as possible; the data are in about 1,400 CSV files within 6 individual ZIP files. The CSV files are all similar timeseries data with the same columns, except some CSV files have an extra column ("yield_c" in this example). I need to import all data and would prefer to not unzip the ZIP files. I've had success with a few approaches but both have their problems; we'll call them A and B. Approach A seems to run much faster than B, but it is adding each CSV header as a new data row and (likely related) is causing all columns to be read as character. Approach B processes the full dataset correctly but is abominably slow.

Any tips on coding this better for efficiency and a proper data pull? I think I am missing something simple about how freadr processes the cmd input but I can't figure out how to make approach A work correctly without adding the header rows every time.

Here is a sample dataset of "only" 240 CSV files among 6 ZIP files with my two data processor functions datprocessorA() and datprocessorB(). Currently (for me) approach A takes a few seconds and approach B takes 2.5 minutes.

Apologies for any formatting or etiquette issues, I don't post here as I can usually find a solution. Similarly, apologies if this is a dupe but I've scoured the forums and haven't found anything to help this specific case; plenty of related issues but not this one in particular.

library(data.table)

# Create sample CSV files and zip them in a folder

tempath <- "temp"
dir.create(file.path(tempath), showWarnings = FALSE)

setwd(tempath)

dfs <- list()
num_dfs <- 240

for (i in 1:num_dfs) {
  
  data <- data.frame(date = seq(as.Date("1980-10-01"), as.Date("2020-09-01"), by = "month"),
                     Subset = paste0("Subset", i),
                     yield_a = sample(0:20,480, replace = TRUE),
                     yield_b = sample(0:5,480, replace = TRUE))
  
  df_name <- paste0("df", i)
  
  dfs[[df_name]] <- data
}

# randomly (ish) add "yield_c" column to a subset of dataframes
seqz <- sort(unique(c(1,sample(seq(1,num_dfs, by = 5), size = num_dfs%/%3, replace = TRUE))))

for (j in seqz) {
  dfs[[j]]$yield_c = sample(1:10,480, replace = TRUE)
}

# write all to csv, zip up, delete csvs
for (k in seq_along(dfs)) {
  file_name <- paste0("file", k, ".csv") # Create a unique filename
  write.csv(dfs[[k]], file = file_name, row.names = FALSE) # Save to CSV
}

zipcount <- 6
for (l in 1:zipcount) {
  zip(paste0("zip",l,".zip"), paste0("file", seq(num_dfs/zipcount*(l-1)+1,num_dfs/zipcount*l,1), ".csv"))
}

file.remove(list.files(pattern = "\\.csv$", full.names = TRUE))
# import

# return to parent directory to more closely imitate my scenario
setwd("..")

#get path for all zip files
zips <- paste0(tempath, "/", list.files(path = tempath, full.names = F, recursive = F))

# first function to import all csvs into R as a table

datprocessorA <- function(zipdir) {
  
  connz <- paste("unzip -p",zipdir)
  datz <- rbindlist(lapply(connz, function(x) fread(x, sep=',', header = TRUE,
                                                   stringsAsFactors=FALSE, fill = Inf)),
                    use.names = TRUE, fill=TRUE)

  datz
}

# alternative function to import all csvs into R as a table
datprocessorB <- function(zipdir) {
  
  datz <- lapply(setNames(nm = unzip(zipdir, list = TRUE)$Name),
                 function(fn) fread(cmd = paste("unzip -p -a",zipdir, fn), header = TRUE,
                                    stringsAsFactors=FALSE, fill = Inf)) %>%
    rbindlist(use.names = TRUE, fill=TRUE)
  datz
}

# for loop to process import function; choose datprocessorA or datprocessorB

funx <- datprocessorA

begin <- Sys.time()
for (i in 1:length(zips)) {
  
  zipdir <- zips[i]
  
  if (i == 1) {
    import  <- funx(zipdir)
  } else {
    dat2 <- funx(zipdir)
    import  <- rbind(import, dat2, fill = TRUE)
    rm(dat2)
  }
}
end <- Sys.time()

print(paste0("Approach A length: ",round(end-begin), " seconds"))

funx <- datprocessorB

begin <- Sys.time()
for (i in 1:length(zips)) {
  
  zipdir <- zips[i]
  
  if (i == 1) {
    import  <- funx(zipdir)
  } else {
    dat2 <- funx(zipdir)
    import  <- rbind(import, dat2, fill = TRUE)
    rm(dat2)
  }
}
end <- Sys.time()

print(paste0("Approach B length: ",round(end-begin), " minutes"))

Rui Barradas · Accepted Answer · 2025-07-29 11:28:42Z

3

The following function seems to do what you want and is fast.
It reads all files in the zips passed in a vector zip_vector, with no need for a for loop.

datprocessorC <- function(zip_vector) {
  f <- function(zipfile) {
    fls <- unzip(zipfile, list = TRUE)$Name
    lapply(fls, \(f) fread(unzip(zipfile, files = f))) |> rbindlist(fill = TRUE)
  }
  lapply(zip_vector, f) |> rbindlist(fill = TRUE)
}

# run on the zip files vector returned by list.files()
importC <- datprocessorC(zips)

Edit

Here is another function that reads the files with unz/readr::read_csv without unzipping. No file1.csv, file2.csv, etc are created.

This function takes about twice the time of datprocessorC to extract the data.

datprocessorD <- function(zip_vector) {
  f <- function(zipfile) {
    fls <- unzip(zipfile, list = TRUE)$Name
    out <- vector("list", length(fls))
    for(i in seq_along(fls)) {
      tmp <- unz(zipfile, filename = fls[i])
      out[[i]] <- readr::read_csv(file = tmp, show_col_types = FALSE)
    }
    data.table::rbindlist(out, fill = TRUE)
  }
  lapply(zip_vector, f) |> data.table::rbindlist(fill = TRUE)
}

edited Jul 29 at 11:28

answered Jul 28 at 17:32

Rui Barradas

78k8 gold badges41 silver badges75 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Riley Jul 29 at 10:56

30 seconds on my real dataset, columns are set to their proper classes, and no repeating header rows. This is great except it does physically unzip all of the files into the working directory. I can either: edit to unzip to a temp directory, write a line of code to delete after, or get over it. Thanks!

Rui Barradas Jul 29 at 11:28

@Riley Glad it helped. I have edited the answer with another function that doesn't unzip the files but is slower. See if it can help to solve your use case.

Friede Jul 29 at 15:48

"with no need for a for loop."--there are hidden loops (lapply()s), where one could be avoided.

Rui Barradas Jul 29 at 17:37

@Friede That's true, thanks. Anyway, the OP doesn't want to unzip the files and unz only extracts one file within the .zip at a time, so the inner loop is needed.

Riley Jul 29 at 19:18

Thanks Rui, I'm finding it easier and faster to just unzip to a temp directory and delete after. This was very helpful, much appreciated!

margusl · Accepted Answer · 2025-07-29 15:26:18Z

Sorry, no fread(), but DuckDB combined with zipfs community extension is also reasonably fast, even though zipfs devs warn that its more about convenience than high performance - Performance considerations . And globbing for both zip files and archive content is indeed nice.

DuckDB through duckplyr, timing stops after full materialization:

library(duckplyr)
db_exec("INSTALL zipfs FROM community")
db_exec("LOAD zipfs")

tictoc::tic()
import_dd <-
  read_csv_duckdb(
    path = "zip://temp/*.zip/*.csv", 
    options = list(union_by_name = TRUE)
  ) |> 
  data.table::setDT()
tictoc::toc()
#> 2.71 sec elapsed

import_dd
#>               date    Subset yield_a yield_b yield_c
#>             <Date>    <char>   <num>   <num>   <num>
#>      1: 1980-10-01   Subset1       4       2       3
#>      2: 1980-11-01   Subset1       8       1       3
#>      3: 1980-12-01   Subset1       3       1       9
#>      4: 1981-01-01   Subset1       1       1       2
#>      5: 1981-02-01   Subset1      17       0       9
#>     ---                                             
#> 115196: 2020-05-01 Subset240      18       0      NA
#> 115197: 2020-06-01 Subset240       3       3      NA
#> 115198: 2020-07-01 Subset240       5       3      NA
#> 115199: 2020-08-01 Subset240      15       0      NA
#> 115200: 2020-09-01 Subset240       6       4      NA

This was with default DuckDB settings, it sets thread count to the number of CPU cores, 8 in my case (Core i7).

G. Grothendieck · Accepted Answer · 2025-08-07 13:54:24Z

Using the external tar command (which also unzips and is included with Windows 11 and most Linux distributions) iterate over the zip files running tar and excluding header rows and then read that in with fread. This is for Windows 11. On Linux use the commented out lines instead of the 2 lines prior to them.

library(data.table)

p <- proc.time()

col.names <- c("date", "Subset", "yield_a", "yield_b", "yield_c")

tar <- Sys.which("tar")[[1]]
fmt <- paste(tar, "-xOf %s | findstr /b /v .date")
# unzip <- Sys.which("unzip")[[1]]
# fmt <- paste(unzip, "-p %s | grep -v date")  # uncomment on Linux

res <- Sys.glob("*.zip") |>
  lapply(\(x) fread(cmd = sprintf(fmt,x), col.names = col.names, fill = TRUE)) |>
  rbindlist()

proc.time() - p
##    user  system elapsed 
##    0.02    0.14    0.85

Old

Using the external 7z program this takes 7 seconds on a Windows 11 system (i7 processor). It does not create any intermediate files. It first reads and concatenates in all the files in each zip file and then splits out the header lines keeping only the longest which it puts up front and then reads that text as a csv.

library(magrittr)

p <- proc.time()

dat <- zips %>%
  lapply(function(file) shell(paste("7z x -so", file), intern = TRUE)) %>%
  unlist %>%
  split(!grepl("date", .)) %>%
  within(`FALSE` <- `FALSE`[which.max(nchar(`FALSE`))]) %>%
  unlist %>%
  read.csv(text = .)

proc.time() - p
##    user  system elapsed 
##    0.87    4.80    7.02

Using gawk we can process this in under a second on my PC. This is for Widows but it woud not be hard to modify it for Linux.

dir/b *.zip | gawk -e "{ system(\"7z.exe x -so \" $0 ) }" > all.csv  
findstr /b .date all.csv | sort /r | gawk "NR==1" > final.csv
findstr /b "[0-9]" all.csv >> final.csv

Collectives™ on Stack Overflow

How can I read multiple csv files from within multiple zip files with sometimes different columns using R fread?

3 Answers 3

Edit

5 Comments

Comments

Old

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Edit

5 Comments

Comments

Old

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related