Subsetting a large vector uses unnecessarily large amounts of memory

Question

I have a large raw vector, e.g.:

x <- rep(as.raw(1:10), 4e8) # this vector is about 4 GB

I just want to remove the first element, but no matter what I do it uses a huge amount of memory.

> x <- tail(x, length(x)-1)
Error: cannot allocate vector of size 29.8 Gb
> x <- x[-1L]
Error: cannot allocate vector of size 29.8 Gb
> x <- x[seq(2, length(x)-1)]
Error: cannot allocate vector of size 29.8 Gb

What's going on? Do I really have to rely on C to do such a simple operation? (I know it's simple to do with Rcpp but that's not the point).

SessionInfo:

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] dplyr_0.8.3

loaded via a namespace (and not attached):
 [1] tidyselect_0.2.5 compiler_3.6.1   magrittr_1.5     assertthat_0.2.1
 [5] R6_2.4.0         pillar_1.4.2     glue_1.3.1       tibble_2.1.3
 [9] crayon_1.3.4     Rcpp_1.0.2       pkgconfig_2.0.2  rlang_0.4.0
[13] purrr_0.3.2

Rcpp solution as @jangoreki asked for:

#include <Rcpp.h>
using namespace Rcpp;

// solution for the original question
// [[Rcpp::export]]
IntegerVector popBeginningOfVector(IntegerVector x, int npop) {
  return IntegerVector(x.begin() + npop, x.end());
}

// generic negative indexing
// [[Rcpp::export]]
IntegerVector efficientNegativeIndexing(IntegerVector x, IntegerVector neg_idx) {
  std::sort(neg_idx.begin(), neg_idx.end());
  size_t ni_size = neg_idx.size();
  size_t xsize = x.size();
  int * xptr = INTEGER(x);
  int * niptr = INTEGER(neg_idx);
  size_t xtposition = 0;
  IntegerVector xt(xsize - ni_size); // allocate new vector of the correct size
  int * xtptr = INTEGER(xt);
  int range_begin, range_end;
  for(size_t i=0; i < ni_size; ++i) {
    if(i == 0) {
      range_begin = 0;
    } else {
      range_begin = neg_idx[i-1];
    }
    range_end = neg_idx[i] - 1;
    // std::cout << range_begin << " " << range_end << std::endl;
    std::copy(xptr+range_begin, xptr+range_end, xtptr+xtposition);
    xtposition += range_end - range_begin;
  }
  std::copy(xptr+range_end+1, xptr + xsize, xtptr+xtposition);
  return xt;
}

On my machine, I get the same exact error messages for the 1st and 3rd ways, but for x <- x[-1L] my error message says ‘14.9 Gb’ (half the memory). — Joe
– Joe, Commented Aug 3, 2019 at 1:31
@Grada Gukovic, did you use backticks around [ or quotes, because when I enter is.primitive(‘[‘) using backticks, R returns TRUE. — Joe
– Joe, Commented Aug 3, 2019 at 10:46
@Joe: The source to the base method for [ is in svn.r-project.org/R/trunk/src/main/subset.c, function do_subset. Most of the work for your case would happen in VectorSubset in the same file. — user2554330
– user2554330, Commented Aug 3, 2019 at 12:38
@jangorecki Sure, a pointer is just an integer -- but the underlying memory would still be allocated, so you could have something like a memory leak if you treat the pointer inappropriately. It would also add overhead as you'd need to keep track of both the original pointer (in order to de-allocate) and the shifted pointer). — thc
– thc, Commented Aug 6, 2019 at 19:27

user2554330 · Accepted Answer · 2019-08-04 01:50:14Z

9

The problem is that the code to do subsetting allocates a vector of the indices corresponding to the elements you want. For your example, that's the vector 2:4e9.

Recent versions of R can store such vectors very compactly (just first and last element), but the code doing the subsetting doesn't do that, so it needs to store all 4e9-1 values.

Integers would use 4 bytes each, but 4e9 is too big to be an integer, so R stores all those values as 8 byte doubles. That adds up to 32000000040 bytes according to pryr::object_size(2:4e9). That's 29.8 Gb.

To get around this, you would need to make very low level changes to the subsetting code in https://svn.r-project.org/R/trunk/src/main/subset.c and the subscripting code in https://svn.r-project.org/R/trunk/src/main/subscript.c.

Since this is such a specialized case and the alternative (doing it all in C or C++) is so much easier, I don't think R Core is going to put a lot of effort into this.

edited Aug 4, 2019 at 1:50

answered Aug 3, 2019 at 13:03

user2554330

46.8k4 gold badges52 silver badges114 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

thc Over a year ago

Thanks for the explanation. I don't think this is really such a specialized case, since any large vector would suffer the same issues. Hopefully R Core would fix issues like this -- not everyone using R knows Rcpp.

user2554330 Over a year ago

If you want it fixed, think about how to put together a patch. One approach would involve handling large subsets like yours in batches, instead of trying to do them all at once. I don't know how complicated that would end up being. It's probably easier to just buy a ton of memory, so 30 Gb allocations are no big deal.

thc Over a year ago

I can see why it might be an issue for positive indexing, but for negative indexing x[-1] there's no reason R needs to allocate such large amounts of memory. And expecting users to have 34 GB+ of ram isn't realistic c'mon. Other languages e.g. python don't have this issue.

user2554330 Over a year ago

@Hugh: If you want the answer to a different question, you should post that question separately.

jangorecki Over a year ago

@Hugh I am not sure if it is possible to exclude first element in-place. Is it possible to shift a data pointer? If that would be the last element then setting SET_TRUELENGTH will work.

|

Collectives™ on Stack Overflow

Subsetting a large vector uses unnecessarily large amounts of memory

1 Answer 1

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related