In an university class, I need to work with a pretty big longitudinal data set: .rds-file is around 300mb, in total 380,000 observations of 5160 variables. The data set goes back to 1984, however I only the need the years from 2012. So in order to make things easier and more handleable I want to load the whole data set once, then use the filter-function to get rid of all the years before 2012, then discard all the variables which I dont need with select function and save the whole thing into a new, much smaller, more handleable .rds-file.
This is my code so far:
library(tidyr)
setwd("F:/data")
pl <- readRDS("pl.rds")
pl <- pl %>% filter (syear > 2012)
saveRDS(pl, file = "pl_2012.rds")
Loading the data set pl does actually work on my desktop pc (on my laptop, I can't even do that), but when I try to use filter I get: "Error: Cant allocate vector of size 14,5gb".
I know this means, that there's not enough memory for the operation. However, I don't understand why I get this here. The filter function should trim down the object and get rid of all the years I don't need, so the object in the workspace should get significantly smaller. I purposely used it on pl itself, to reduce its size and not create an additional object that takes up more memory. So why do I still get this error and more importantly, what can I do to fix it? Of course, I already closed every other non-important task and application in the background to get as much RAM as possible. Is there anything else I can do? I already have 16GB of RAM, other people in my class have 16GB as well, and for them the same method works just fine..so there must be a way..
filterfunction you're likely trying to use is in the dplyr package (not tidyr) and it doesn't appear you've loaded dplyr. That means when you callfilteryou may in fact be calling the base R function of the same name, which is attempting to run a linear filtering algorithm on a time series, and may indeed run out of memory on large data.filterfrom dplyr package and I forgot to load it. However, even after loading dplyr, I still get the error message. Now its justvector of size 785kband not 14,5gb, but it still doesn't workdata.tablepackage does more operations in-place and might be slightly more useful in this situation ... ?