13

I have a large raw vector, e.g.:

x <- rep(as.raw(1:10), 4e8) # this vector is about 4 GB

I just want to remove the first element, but no matter what I do it uses a huge amount of memory.

> x <- tail(x, length(x)-1)
Error: cannot allocate vector of size 29.8 Gb
> x <- x[-1L]
Error: cannot allocate vector of size 29.8 Gb
> x <- x[seq(2, length(x)-1)]
Error: cannot allocate vector of size 29.8 Gb

What's going on? Do I really have to rely on C to do such a simple operation? (I know it's simple to do with Rcpp but that's not the point).

SessionInfo:

R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] dplyr_0.8.3

loaded via a namespace (and not attached):
 [1] tidyselect_0.2.5 compiler_3.6.1   magrittr_1.5     assertthat_0.2.1
 [5] R6_2.4.0         pillar_1.4.2     glue_1.3.1       tibble_2.1.3
 [9] crayon_1.3.4     Rcpp_1.0.2       pkgconfig_2.0.2  rlang_0.4.0
[13] purrr_0.3.2

Rcpp solution as @jangoreki asked for:

#include <Rcpp.h>
using namespace Rcpp;

// solution for the original question
// [[Rcpp::export]]
IntegerVector popBeginningOfVector(IntegerVector x, int npop) {
  return IntegerVector(x.begin() + npop, x.end());
}

// generic negative indexing
// [[Rcpp::export]]
IntegerVector efficientNegativeIndexing(IntegerVector x, IntegerVector neg_idx) {
  std::sort(neg_idx.begin(), neg_idx.end());
  size_t ni_size = neg_idx.size();
  size_t xsize = x.size();
  int * xptr = INTEGER(x);
  int * niptr = INTEGER(neg_idx);
  size_t xtposition = 0;
  IntegerVector xt(xsize - ni_size); // allocate new vector of the correct size
  int * xtptr = INTEGER(xt);
  int range_begin, range_end;
  for(size_t i=0; i < ni_size; ++i) {
    if(i == 0) {
      range_begin = 0;
    } else {
      range_begin = neg_idx[i-1];
    }
    range_end = neg_idx[i] - 1;
    // std::cout << range_begin << " " << range_end << std::endl;
    std::copy(xptr+range_begin, xptr+range_end, xtptr+xtposition);
    xtposition += range_end - range_begin;
  }
  std::copy(xptr+range_end+1, xptr + xsize, xtptr+xtposition);
  return xt;
}
12
  • 2
    On my machine, I get the same exact error messages for the 1st and 3rd ways, but for x <- x[-1L] my error message says ‘14.9 Gb’ (half the memory). Commented Aug 3, 2019 at 1:31
  • 1
    @Shree I don't see any solution there. Commented Aug 3, 2019 at 1:40
  • 1
    @Grada Gukovic, did you use backticks around [ or quotes, because when I enter is.primitive(‘[‘) using backticks, R returns TRUE. Commented Aug 3, 2019 at 10:46
  • 2
    @Joe: The source to the base method for [ is in svn.r-project.org/R/trunk/src/main/subset.c, function do_subset. Most of the work for your case would happen in VectorSubset in the same file. Commented Aug 3, 2019 at 12:38
  • 1
    @jangorecki Sure, a pointer is just an integer -- but the underlying memory would still be allocated, so you could have something like a memory leak if you treat the pointer inappropriately. It would also add overhead as you'd need to keep track of both the original pointer (in order to de-allocate) and the shifted pointer). Commented Aug 6, 2019 at 19:27

1 Answer 1

9

The problem is that the code to do subsetting allocates a vector of the indices corresponding to the elements you want. For your example, that's the vector 2:4e9.

Recent versions of R can store such vectors very compactly (just first and last element), but the code doing the subsetting doesn't do that, so it needs to store all 4e9-1 values.

Integers would use 4 bytes each, but 4e9 is too big to be an integer, so R stores all those values as 8 byte doubles. That adds up to 32000000040 bytes according to pryr::object_size(2:4e9). That's 29.8 Gb.

To get around this, you would need to make very low level changes to the subsetting code in https://svn.r-project.org/R/trunk/src/main/subset.c and the subscripting code in https://svn.r-project.org/R/trunk/src/main/subscript.c.

Since this is such a specialized case and the alternative (doing it all in C or C++) is so much easier, I don't think R Core is going to put a lot of effort into this.

Sign up to request clarification or add additional context in comments.

8 Comments

Thanks for the explanation. I don't think this is really such a specialized case, since any large vector would suffer the same issues. Hopefully R Core would fix issues like this -- not everyone using R knows Rcpp.
If you want it fixed, think about how to put together a patch. One approach would involve handling large subsets like yours in batches, instead of trying to do them all at once. I don't know how complicated that would end up being. It's probably easier to just buy a ton of memory, so 30 Gb allocations are no big deal.
I can see why it might be an issue for positive indexing, but for negative indexing x[-1] there's no reason R needs to allocate such large amounts of memory. And expecting users to have 34 GB+ of ram isn't realistic c'mon. Other languages e.g. python don't have this issue.
@Hugh: If you want the answer to a different question, you should post that question separately.
@Hugh I am not sure if it is possible to exclude first element in-place. Is it possible to shift a data pointer? If that would be the last element then setting SET_TRUELENGTH will work.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.