I'm working with 50 Parquet files (each file is of ~800MB with ~380000 rows and ~8 columns). I need to perform a grouped summarisation in R. Something like:
group_by(sample_id, gene1, gene2) %>%
summarise(mean_importance = mean(importance),
mean_count = mean(n_count)) %>%
pivot_wider(names_from = "sample_id",
values_from = c("mean_importance", "mean_count"),
names_sep = "__")
Here, pivot_wider() is not available via arrow. So just before it, I need to collect() the data as a dataframe and then apply pivot_wider(). As soon as I apply collect(), I encounter memory issues (core dumped, bad_alloc). What is the best way to handle this large data without running into memory errors? The experimental batch processing seemed like an option, but I will not be able to make batches by random sub-setting. Rather, it would be ideal to subset via the group_by columns.
I was trying out listing all the possible groups and then using mclapply to process the data.
pq_files <- list.files("path", full.names = TRUE)
pq_files <- open_dataset(sources = pq_files)
grp_list <- expand_grid("gene1" = gene1,
"gene2" = gene2) %>%
filter(gene1 != gene2)
res <- mclapply(X = 1:nrow(grp_list),
mc.cores = 60,
FUN = function(i){
pq_files %>%
filter((gene1 == grp_list$gene1[i]) & (gene1 == grp_list$gene1[i])) %>%
group_by(sample_id, gene1, gene2) %>%
summarise(mean_importance = mean(importance),
mean_count = mean(n_count)) %>%
collect() %>%
pivot_wider(names_from = "sample_id",
values_from = c("mean_importance", "mean_count"),
names_sep = "__")
})
But this means it will repeatedly interact with the files on disk, I suppose. Is there a more efficient way to do this? Will using other filetypes/packages help improve performance?
Note: Query cross-posted in GitHub/arrow.
sample_id? Are you after dplyr-like pipeline (arrow / dbplyr / duckplyr) or would duckdb + SQL be OK? Anyway, a small reproducible example dataset would certainly help others to help you.pivot_wider.tbl_lazy()collects data and thus triggers early materialization in duckdb. From linked doc: "Note thatpivot_wider()is not and cannot be lazy because we need to look at the data to figure out what the new column names will be. If you have a long running query you have two options ..." Small dataset would do just fine for testing strategies and checking generated dbplyr sql queries & duckdb execution plans to assess if those might scale.