2

The function below, used after sorting within the grouping variable grp, is intended to provide cumulative share that can be used for quantile measurement. It's rather odd structure is because all of these variables are about 6 million lines long, and every time I copy another variable and hold it in memory it increases the chance that my analysis will crash, so I try not to hold more than two variables in memory at any one time. (testX() is just my little object testing progeam -- does an str, a summary, etc.

popWt <-        c(1,2,3,1,2,3,4)
year  <- factor(c(1,1,1,2,2,2,2))

So the desired outcome from the data above after unlisting is roughly:

0.166666667 0.5 1   0.1 0.3 0.6 1

cumPopShare.L produces cumulative shares of the population, for groups defined by a factor (grp), and with an optional logical vector to select sub-samples prior to cumulation. Often results are most meaningful if population is sorted prior to cumulation.

 cumPopShare.L <- function(pop, select.L=NULL, grp){
   if (!is.null(select.L)) {pop <- pop * select.L}    
   groups   <- split(pop, grp)   
   gLengths <-  lapply(groups, FUN=length)   
   gSums    <- lapply(groups, FUN=sum)  
   function(groups, gLengths, gSums)            
   out.L <- list(numeric()) 
   str(gLengths[1])   
   out.L[[1]]  <-list(numeric(length=as.numeric(gLengths[1])))   
   testX(out.L)
   for (i in length(groups)){
       str(gLengths[i])
       testX(out.L)
       out.L[[i]] <- rep_len(1/gSums[[i]], length.out=gLengths[[i]]) *
           cumsum(groups[[i]])  
     }   
     out.L 
 }          
cumPopShare.V <- unlist(cumPopShare.L(pop=popWt, grp=year), use.names=FALSE)

I am getting several slightly different versions of this error:

List of 1 <- $ 1: int 3

>Error in out.L[[1]] <- list(numeric(length = as.numeric(gLengths[1]))) : 
>  object 'out.L' not found

This error is from the second appearance of out.L, but when i put a summary or str in after the first, it also denied that out.L exists.

I find this puzzling because in both cases I am trying to assign something to elements of the out.L variable with [[<-. I have tested these assignments at the command line level, and both of them work fine, so I am guessing that this is a scoping issue. But I'v been bashing my head against it for hours, and all i have gotten is a sore head.

This is R 3.0.2, running under RStudio, on a a cranky old windows XP machine.

Any help or suggestions would be much appreciated

Peace, andrewH

0

1 Answer 1

2

I think I got it now. The reason for output error is the function without {} and other scrap inside the function. Try this:

 cumPopShare.L <- function(pop, select.L=NULL, grp){
   if (!is.null(select.L)) {pop <- pop * select.L}    
   groups   <- split(pop, grp)   
   gLengths <-  lapply(groups, FUN=length)   
   gSums    <- lapply(groups, FUN=sum)            
   out.L <- list(numeric())  
   for (i in seq_along(groups)){
       out.L[[i]] <- rep_len(1/gSums[[i]], length.out=gLengths[[i]]) * cumsum(groups[[i]])  
     }   
     return(out.L) 
 }        

Which returns:

unlist(cumPopShare.L(pop=popWt, grp=year), use.names=FALSE)
[1] 0.1666667 0.5000000 1.0000000 0.1000000 0.3000000 0.6000000 1.0000000

Btw, if you have 6 million lines, you should not use for loops. I am not an expert in this (=others should confirm this) and it is out of the scope of the question, but I think apply loops are faster. For even faster results learn to use data.tableand plyr packages. The function with apply loop would be something like:

cumPopShare.L <- function(pop, select.L=NULL, grp){
   if (!is.null(select.L)) {pop <- pop * select.L}    
   groups   <- split(pop, grp)   
   gLengths <-  lapply(groups, FUN=length)   
   gSums    <- lapply(groups, FUN=sum)            
   out.L <- lapply(seq_along(groups), function(i) rep_len(1/gSums[[i]], length.out=gLengths[[i]]) * cumsum(groups[[i]])) 
  return(out.L)} 

Try if that is faster.

Sign up to request clarification or add additional context in comments.

1 Comment

That is a thing of beauty. It would never have occurred to me to use the list (X) term in lappy to hold a simple index. The reason i was using a for loop at all was that I couldn't get apply fns to keep the grouped pieces of several vectors aligned unless i bound them together in X. I've extracted just the variables I'm using from a 14 gig file ta a data frame, but even so, it is big enough that my memory will not hold two copies of it, or it & copies of more than about 1/3 of the variables. The X=i trick lets you do away with the loop without making copies of the vectors in X. Excellent!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.