1

In an university class, I need to work with a pretty big longitudinal data set: .rds-file is around 300mb, in total 380,000 observations of 5160 variables. The data set goes back to 1984, however I only the need the years from 2012. So in order to make things easier and more handleable I want to load the whole data set once, then use the filter-function to get rid of all the years before 2012, then discard all the variables which I dont need with select function and save the whole thing into a new, much smaller, more handleable .rds-file.

This is my code so far:

library(tidyr)
setwd("F:/data")
pl <- readRDS("pl.rds")

pl <- pl %>% filter (syear > 2012)
saveRDS(pl, file = "pl_2012.rds")

Loading the data set pl does actually work on my desktop pc (on my laptop, I can't even do that), but when I try to use filter I get: "Error: Cant allocate vector of size 14,5gb".

I know this means, that there's not enough memory for the operation. However, I don't understand why I get this here. The filter function should trim down the object and get rid of all the years I don't need, so the object in the workspace should get significantly smaller. I purposely used it on pl itself, to reduce its size and not create an additional object that takes up more memory. So why do I still get this error and more importantly, what can I do to fix it? Of course, I already closed every other non-important task and application in the background to get as much RAM as possible. Is there anything else I can do? I already have 16GB of RAM, other people in my class have 16GB as well, and for them the same method works just fine..so there must be a way..

3
  • 8
    The filter function you're likely trying to use is in the dplyr package (not tidyr) and it doesn't appear you've loaded dplyr. That means when you call filter you may in fact be calling the base R function of the same name, which is attempting to run a linear filtering algorithm on a time series, and may indeed run out of memory on large data. Commented Dec 17, 2023 at 23:03
  • @Joran I seriously feel like the most stupid person on earth right now...of course, I wanted to use filter from dplyr package and I forgot to load it. However, even after loading dplyr, I still get the error message. Now its just vector of size 785kb and not 14,5gb, but it still doesn't work Commented Dec 18, 2023 at 11:05
  • the data.table package does more operations in-place and might be slightly more useful in this situation ... ? Commented Dec 18, 2023 at 16:30

1 Answer 1

1

For working with large datasets the arrow package might provide a solution. See the documentation for some examples.

But in the case of your code you could use:

library(dplyr)
library(arrow)

setwd("F:/data")
pl <- readRDS("pl.rds")

# define folder to store partitioned data file
dataset_path <- file.path(getwd(), "subset")
if(!dir.exists(dataset_path)) dir.create(dataset_path)

# break up file in smaller subsets
pl %>%
  group_by(syear) %>%
  write_dataset(dataset_path)

rm(pl)
gc()

# check
list.files(dataset_path, recursive = TRUE)

# make connection to data
dset <- open_dataset(dataset_path)

# do lazy loading and processing, example filtering
pl <- dset %>%
  filter(syear > 2012) %>%
  collect()

And you can use this not only to filter, but to do all kinds of operations without needing the full dataset in memory.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.