How to fix "Cannot allocate vector of size..." when using filter-function? [closed]

Question

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.

This question was caused by a typo or a problem that can no longer be reproduced. While similar questions may be on-topic here, this one was resolved in a way less likely to help future readers.

Closed 1 year ago.

Improve this question

In an university class, I need to work with a pretty big longitudinal data set: .rds-file is around 300mb, in total 380,000 observations of 5160 variables. The data set goes back to 1984, however I only the need the years from 2012. So in order to make things easier and more handleable I want to load the whole data set once, then use the filter-function to get rid of all the years before 2012, then discard all the variables which I dont need with select function and save the whole thing into a new, much smaller, more handleable .rds-file.

This is my code so far:

library(tidyr)
setwd("F:/data")
pl <- readRDS("pl.rds")

pl <- pl %>% filter (syear > 2012)
saveRDS(pl, file = "pl_2012.rds")

Loading the data set pl does actually work on my desktop pc (on my laptop, I can't even do that), but when I try to use filter I get: "Error: Cant allocate vector of size 14,5gb".

I know this means, that there's not enough memory for the operation. However, I don't understand why I get this here. The filter function should trim down the object and get rid of all the years I don't need, so the object in the workspace should get significantly smaller. I purposely used it on pl itself, to reduce its size and not create an additional object that takes up more memory. So why do I still get this error and more importantly, what can I do to fix it? Of course, I already closed every other non-important task and application in the background to get as much RAM as possible. Is there anything else I can do? I already have 16GB of RAM, other people in my class have 16GB as well, and for them the same method works just fine..so there must be a way..

The filter function you're likely trying to use is in the dplyr package (not tidyr) and it doesn't appear you've loaded dplyr. That means when you call filter you may in fact be calling the base R function of the same name, which is attempting to run a linear filtering algorithm on a time series, and may indeed run out of memory on large data. — joran
– joran, Commented Dec 17, 2023 at 23:03
@Joran I seriously feel like the most stupid person on earth right now...of course, I wanted to use filter from dplyr package and I forgot to load it. However, even after loading dplyr, I still get the error message. Now its just vector of size 785kb and not 14,5gb, but it still doesn't work — Moritary
– Moritary, Commented Dec 18, 2023 at 11:05
the data.table package does more operations in-place and might be slightly more useful in this situation ... ? — Ben Bolker
– Ben Bolker, Commented Dec 18, 2023 at 16:30

Tjark van de Merwe · Accepted Answer · 2023-12-18 16:32:40Z

1

For working with large datasets the arrow package might provide a solution. See the documentation for some examples.

But in the case of your code you could use:

library(dplyr)
library(arrow)

setwd("F:/data")
pl <- readRDS("pl.rds")

# define folder to store partitioned data file
dataset_path <- file.path(getwd(), "subset")
if(!dir.exists(dataset_path)) dir.create(dataset_path)

# break up file in smaller subsets
pl %>%
  group_by(syear) %>%
  write_dataset(dataset_path)

rm(pl)
gc()

# check
list.files(dataset_path, recursive = TRUE)

# make connection to data
dset <- open_dataset(dataset_path)

# do lazy loading and processing, example filtering
pl <- dset %>%
  filter(syear > 2012) %>%
  collect()

And you can use this not only to filter, but to do all kinds of operations without needing the full dataset in memory.

edited Dec 18, 2023 at 16:32

answered Dec 18, 2023 at 16:27

Tjark van de Merwe

637 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

How to fix "Cannot allocate vector of size..." when using filter-function? [closed]

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related