Extract last non-missing value in row with data.table

Question

I have a data.table of factor columns, and I want to pull out the label of the last non-missing value in each row. It's kindof a typical max.col situation, but I don't want to needlessly be coercing as I am trying to optimize this code using data.table. The real data has other types of columns as well.

Here is the example,

## Some sample data
set.seed(0)
dat <- sapply(split(letters[1:25], rep.int(1:5, 5)), sample, size=8, replace=TRUE)
dat[upper.tri(dat)] <- NA
dat[4:5, 4:5] <- NA                              # the real data isnt nice and upper.triangular
dat <- data.frame(dat, stringsAsFactors = TRUE)  # factor columns

## So, it looks like this
setDT(dat)[]
#    X1 X2 X3 X4 X5
# 1:  u NA NA NA NA
# 2:  f  q NA NA NA
# 3:  f  b  w NA NA
# 4:  k  g  h NA NA
# 5:  u  b  r NA NA
# 6:  f  q  w  x  t
# 7:  u  g  h  i  e
# 8:  u  q  r  n  t

## I just want to get the labels of the factors
## that are 'rightmost' in each row.  I tried a number of things 
## that probably don't make sense here.
## This just about gets the column index
dat[, colInd := sum(!is.na(.SD)), by=1:nrow(dat)]

This is the goal though, to extract these labels, here using regular base functions.

## Using max.col and a data.frame
df1 <- as.data.frame(dat)
inds <- max.col(is.na(as.matrix(df1)), ties="first")-1
inds[inds==0] <- ncol(df1)
df1[cbind(1:nrow(df1), inds)]
# [1] "u" "q" "w" "h" "r" "t" "e" "t"

Frank · Accepted Answer · 2015-11-12 17:49:46Z

Here's another way:

dat[, res := NA_character_]
for (v in rev(names(dat))[-1]) dat[is.na(res), res := get(v)]


   X1 X2 X3 X4 X5 res
1:  u NA NA NA NA   u
2:  f  q NA NA NA   q
3:  f  b  w NA NA   w
4:  k  g  h NA NA   h
5:  u  b  r NA NA   r
6:  f  q  w  x  t   t
7:  u  g  h  i  e   e
8:  u  q  r  n  t   t

Benchmarks Using the same data as @alexis_laz and making (apparently) superficial changes to the functions, I see different results. Just showing them here in case anyone is curious. Alexis' answer (with small modifications) still comes out ahead.

Functions:

alex = function(x, ans = rep_len(NA, length(x[[1L]])), wh = seq_len(length(x[[1L]]))){
    if(!length(wh)) return(ans)
    ans[wh] = as.character(x[[length(x)]])[wh]
    Recall(x[-length(x)], ans, wh[is.na(ans[wh])])
}   

alex2 = function(x){
    x[, res := NA_character_]
    wh = x[, .I]
    for (v in (length(x)-1):1){
      if (!length(wh)) break
      set(x, j="res", i=wh, v = x[[v]][wh])
      wh = wh[is.na(x$res[wh])]
    }
    x$res
}

frank = function(x){
    x[, res := NA_character_]
    for(v in rev(names(x))[-1]) x[is.na(res), res := get(v)]
    return(x$res)       
}

frank2 = function(x){
    x[, res := NA_character_]
    for(v in rev(names(x))[-1]) x[is.na(res), res := .SD, .SDcols=v]
    x$res
}

Example data and benchmark:

DAT1 = as.data.table(lapply(ceiling(seq(0, 1e4, length.out = 1e2)), 
                     function(n) c(rep(NA, n), sample(letters, 3e5 - n, TRUE))))
DAT2 = copy(DAT1)
DAT3 = as.list(copy(DAT1))
DAT4 = copy(DAT1)

library(microbenchmark)
microbenchmark(frank(DAT1), frank2(DAT2), alex(DAT3), alex2(DAT4), times = 30)

Unit: milliseconds
         expr       min        lq      mean    median         uq        max neval
  frank(DAT1) 850.05980 909.28314 985.71700 979.84230 1023.57049 1183.37898    30
 frank2(DAT2)  88.68229  93.40476 118.27959 107.69190  121.60257  346.48264    30
   alex(DAT3)  98.56861 109.36653 131.21195 131.20760  149.99347  183.43918    30
  alex2(DAT4)  26.14104  26.45840  30.79294  26.67951   31.24136   50.66723    30

alexis_laz · Accepted Answer · 2015-11-12 14:49:18Z

11

+100

Another idea -similar to Frank's- that tries (1) to avoid subsetting 'data.table' rows (which I assume must have some cost) and (2) to avoid checking a length == nrow(dat) vector for NAs in every iteration.

alex = function(x, ans = rep_len(NA, length(x[[1L]])), wh = seq_len(length(x[[1L]])))
{
    if(!length(wh)) return(ans)
    ans[wh] = as.character(x[[length(x)]])[wh]
    Recall(x[-length(x)], ans, wh[is.na(ans[wh])])
}   
alex(as.list(dat)) #had some trouble with 'data.table' subsetting
# [1] "u" "q" "w" "h" "r" "t" "e" "t"

And to compare with Frank's:

frank = function(x)
{
    x[, res := NA_character_]
    for(v in rev(names(x))[-1]) x[is.na(res), res := get(v)]
    return(x$res)       
}

DAT1 = as.data.table(lapply(ceiling(seq(0, 1e4, length.out = 1e2)), 
                     function(n) c(rep(NA, n), sample(letters, 3e5 - n, TRUE))))
DAT2 = copy(DAT1)
microbenchmark::microbenchmark(alex(as.list(DAT1)), 
                               { frank(DAT2); DAT2[, res := NULL] }, 
                               times = 30)
#Unit: milliseconds
#                                            expr       min        lq    median        uq       max neval
#                             alex(as.list(DAT1))  102.9767  108.5134  117.6595  133.1849  166.9594    30
# {     frank(DAT2)     DAT2[, `:=`(res, NULL)] } 1413.3296 1455.1553 1497.3517 1540.8705 1685.0589    30
identical(alex(as.list(DAT1)), frank(DAT2))
#[1] TRUE

answered Nov 12, 2015 at 14:49

alexis_laz

13.2k4 gold badges29 silver badges37 bronze badges

7 Comments

Frank Over a year ago

Yeah, I got my idea from one of your earlier posts. I wonder how it compares against dat[, colInd := Reduce(function(x,y) x+!is.na(y), .SD, init=0L)][, res := as.character(.SD[[.BY[[1]]]]), by=colInd]. For few cols and many rows, I think this way might be pretty good. Also, the OP's max.col approach would be interesting to see.

alexis_laz Over a year ago

@Frank : With a rough benchmark, Reduce.. is indeed faster than your first approach, but, I guess, the triple reading each column for +, ! and is.na adds some time. I didn't add the max.col, because microbenchmark(as.matrix(DAT1)) seems slow enough to begin with.

alexis_laz Over a year ago

@TheTime : Did you use a "data.table" in the recursive function? I had some trouble with 'data.table' subsetting and used as.list.data.table first.

Frank Over a year ago

I was having the same issue as TheTime, but as.list resolved it, yeah.

Frank Over a year ago

Added another benchmark with your idea but in a loop with set; it's somewhat faster.

|

Colonel Beauvel · Accepted Answer · 2015-11-12 07:43:19Z

4

Here is a one liner base R approach:

sapply(split(dat, seq(nrow(dat))), function(x) tail(x[!is.na(x)],1))
#  1   2   3   4   5   6   7   8 
#"u" "q" "w" "h" "r" "t" "e" "t"

answered Nov 12, 2015 at 7:43

Colonel Beauvel

31.3k11 gold badges49 silver badges88 bronze badges

Comments

akrun · Accepted Answer · 2015-11-12 08:08:47Z

4

We convert the 'data.frame' to 'data.table' and create a row id column (setDT(df1, keep.rownames=TRUE)). We reshape the 'wide' to 'long' format with melt. Grouped by 'rn', if there is no NA element in 'value' column, we get the last element of 'value' (value[.N]) or else, we get the element before the first NA in the 'value' to get the 'V1' column, which we extract ($V1).

melt(setDT(df1, keep.rownames=TRUE), id.var='rn')[,
     if(!any(is.na(value))) value[.N] 
     else value[which(is.na(value))[1]-1], by =  rn]$V1
#[1] "u" "q" "w" "h" "r" "t" "e" "t"

In case, the data is already a data.table

dat[, rn := 1:.N]#create the 'rn' column
melt(dat, id.var='rn')[, #melt from wide to long format
     if(!any(is.na(value))) value[.N] 
     else value[which(is.na(value))[1]-1], by =  rn]$V1
#[1] "u" "q" "w" "h" "r" "t" "e" "t"

Here is another option

dat[, colInd := sum(!is.na(.SD)), by=1:nrow(dat)][
   , as.character(.SD[[.BY[[1]]]]), by=colInd]

Or as @Frank mentioned in the comments, we can use na.rm=TRUE from melt and make it more compact

 melt(dat[, r := .I], id="r", na.rm=TRUE)[, value[.N], by=r]

edited Nov 12, 2015 at 8:08

answered Nov 12, 2015 at 5:00

akrun

891k38 gold badges590 silver badges700 bronze badges

8 Comments

akrun Over a year ago

@TheTime Yes, you can do it like that, but if we have to convert from data.frame to data.table, the options in setDT will be handy.

akrun Over a year ago

@TheTime Sorry for that, I added some explanations. The value is from the default column names after the melt step.

thelatemail Over a year ago

Here's something ridiculous I came up with. I doubt it's worthy of an answer though: dat[, do.call(Map, c(function(...) tail(c(...)[!is.na(c(...))],1), lapply(dat,as.character)) )]

Frank Over a year ago

You can drop NAs in the melt: melt(dat[, r := .I], id="r", na.rm=TRUE)[, value[.N], by=r]

Frank Over a year ago

@TheTime Your .BY option is probably slow because you do a by-row operation before it. Instead... dat[, colInd := Reduce(function(x,y) x+!is.na(y), .SD, init=0L)][, res := as.character(.SD[[.BY[[1]]]]), by=colInd] (Not sure if you want to change it.)

|

A5C1D2H2I1M1N2O1R2T1 · Accepted Answer · 2015-11-16 11:50:53Z

4

I'm not sure how to improve upon @alexis's answer beyond what @Frank has already done, but your original approach with base R wasn't too far off of something that is reasonably performant.

Here's a variant of your approach that I liked because (1) it's reasonably quick and (2) it doesn't require too much thought to figure out what's going on:

as.matrix(dat)[cbind(1:nrow(dat), max.col(!is.na(dat), "last"))]

The most expensive part of this seems to be the as.matrix(dat) part, but otherwise, it seems to be faster than the melt approach that @akrun shared.

edited Nov 16, 2015 at 11:50

answered Nov 16, 2015 at 11:28

A5C1D2H2I1M1N2O1R2T1

194k30 gold badges416 silver badges496 bronze badges

Collectives™ on Stack Overflow

Extract last non-missing value in row with data.table

5 Answers 5

Comments

7 Comments

Comments

8 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

7 Comments

Comments

8 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related