13

I have a data.table with columns of different data types. My goal is to select only numeric columns and replace NA values within these columns by 0. I am aware that replacing na-values with zero goes like this:

DT[is.na(DT)] <- 0

To select only numeric columns, I found this solution, which works fine:

DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE]

I can achieve what I want by assigning

DT2 <- DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE]

and then do:

DT2[is.na(DT2)] <- 0

But of course I would like to have my original DT modified by reference. With the following, however:

DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE]
                 [is.na(DT[, as.numeric(which(sapply(DT,is.numeric))), with = FALSE])]<- 0

I get

"Error in [.data.table([...] i is invalid type (matrix)"

What am I missing? Any help is much appreciated!!

4
  • You are missing the basic syntax of data.tables, which don't do DT[...] <- y. Try reading the vignettes github.com/Rdatatable/data.table/wiki/Getting-started It's a more efficient way to learn than "finding solutions" for each step you think you need to take. The answer below doesn't even require the with=FALSE trick you found. Commented May 23, 2016 at 13:09
  • Thanks for the advice. Could you please eloborate on the basic syntax error "...which don't do DT[...] <- y". What does that mean? Why does the assignment work in one case and not in the other case? I could not find anything in the vignettes, would still help me alot to understand.. Commented May 23, 2016 at 13:42
  • Data tables shouldn't be used like DT[...] <- y where ... is whatever you have in mind. Assignment is done with := or set not with a <-. The arrow way actually does work in special cases, in the sense that the table is modified, but it does not work by reference (last I checked) and so is not idiomatic. To work with data.tables, you'll have to learn some of their idioms. If you don't already know what I mean by :=, that's a good reason to check out the vignettes. Commented May 23, 2016 at 14:17
  • a) It will be much more efficient to compute the column-list numeric_cols <- which(sapply(DT,is.numeric)) once-off at the top, instead of inside each j-expression, for each group. b) Then just reference DT[, numeric_cols] c) Yes, putting a function-call inside the j-expression is tricky and often tickles syntax error. Commented Apr 19, 2018 at 0:52

5 Answers 5

12

We can use set

for(j in seq_along(DT)){
    set(DT, i = which(is.na(DT[[j]]) & is.numeric(DT[[j]])), j = j, value = 0)
 }

Or create a index for numeric columns, loop through it and set the NA values to 0

ind <-   which(sapply(DT, is.numeric))
for(j in ind){
    set(DT, i = which(is.na(DT[[j]])), j = j, value = 0)
}

data

set.seed(24)
DT <- data.table(v1= c(NA, 1:4), v2 = c(NA, LETTERS[1:4]), v3=c(rnorm(4), NA))
Sign up to request clarification or add additional context in comments.

5 Comments

What does set( ..., j = j, ...) mean? All columns? Surely we only need to do set() on the subset of columns that are numeric, as OP asked?
@smci Not all columns. In the code I got the ind which gets the column index of numeric columns, so, it is only looping through those columns
Ok. Why can't you avoid looping, by using ind to index into names(DT) to get a list of column-names and pass that as the j-argument of set()? I guess the expression to find NAs would then need to be 2D. Well I guess set() is already fairly fast.
@smci Not sure I don't understand your question. The j can take either columnames or the column index. Here, 'ind' is the index.
Why can't you avoid the loop for(j in ind) { ... set(..., j=j, ...) } ? Can't you directly do set(DT, j=ind) in general? I think you could, but the only reason for the j-loop is that the i-expression to find NA rows for that specific j changes.
5

I wanted to explore and possibly improve on the excellent answer given above by @akrun. Here's the data he used in his example:

library(data.table)

set.seed(24)
DT <- data.table(v1= c(NA, 1:4), v2 = c(NA, LETTERS[1:4]), v3=c(rnorm(4), NA))
DT

#>    v1   v2         v3
#> 1: NA <NA> -0.5458808
#> 2:  1    A  0.5365853
#> 3:  2    B  0.4196231
#> 4:  3    C -0.5836272
#> 5:  4    D         NA

And the two methods he suggested to use:

fun1 <- function(x){
  for(j in seq_along(x)){
  set(x, i = which(is.na(x[[j]]) & is.numeric(x[[j]])), j = j, value = 0)
  }
}

fun2 <- function(x){
  ind <-   which(sapply(x, is.numeric))
  for(j in ind){
    set(x, i = which(is.na(x[[j]])), j = j, value = 0)
  }
}

I think the first method above is really genius as it exploits the fact that NAs are typed.

First of all, even though .SD is not available in i argument, it is possible to pull the column name with get(), so I thought I could sub-assign data.table this way:

fun3 <- function(x){
  nms <- names(x)[sapply(x, is.numeric)]
  for(j in nms){
    x[is.na(get(j)), (j):=0]
  }
}

Generic case, of course would be to rely on .SD and .SDcols to work only on numeric columns

fun4 <- function(x){
  nms <- names(x)[sapply(x, is.numeric)]
  x[, (nms):=lapply(.SD, function(i) replace(i, is.na(i), 0)), .SDcols=nms]  
}

But then I thought to myself "Hey, who says we can't go all the way to base R for this sort of operation. Here's simple lapply() with conditional statement, wrapped into setDT()

fun5 <- function(x){
setDT(
  lapply(x, function(i){
    if(is.numeric(i))
         i[is.na(i)]<-0
    i
  })
)
}

Finally,we could use the same idea of conditional to limit the columns on which we apply the set()

fun6 <- function(x){
  for(j in seq_along(x)){
    if (is.numeric(x[[j]]) )
      set(x, i = which(is.na(x[[j]])), j = j, value = 0)
  }
}

Here are the benchmarks:

microbenchmark::microbenchmark(
  for.set.2cond = fun1(copy(DT)),
  for.set.ind = fun2(copy(DT)),
  for.get = fun3(copy(DT)),
  for.SDcol = fun4(copy(DT)),
  for.list = fun5(copy(DT)),
  for.set.if =fun6(copy(DT))
)

#> Unit: microseconds
#>           expr     min      lq     mean   median       uq      max neval cld
#>  for.set.2cond  59.812  67.599 131.6392  75.5620 114.6690 4561.597   100 a  
#>    for.set.ind  71.492  79.985 142.2814  87.0640 130.0650 4410.476   100 a  
#>        for.get 553.522 569.979 732.6097 581.3045 789.9365 7157.202   100   c
#>      for.SDcol 376.919 391.784 527.5202 398.3310 629.9675 5935.491   100  b 
#>       for.list  69.722  81.932 137.2275  87.7720 123.6935 3906.149   100 a  
#>     for.set.if  52.380  58.397 116.1909  65.1215  72.5535 4570.445   100 a  

Comments

2

You need tidyverse purrr function map_if along with ifelse to do the job in a single line of code.

library(tidyverse)
set.seed(24)
DT <- data.table(v1= sample(c(1:3,NA),20,replace = T), v2 = sample(c(LETTERS[1:3],NA),20,replace = T), v3=sample(c(1:3,NA),20,replace = T))

Below single line code takes a DT with numeric and non numeric columns and operates just on the numeric columns to replace the NAs to 0:

DT %>% map_if(is.numeric,~ifelse(is.na(.x),0,.x)) %>% as.data.table

So, tidyverse can be less verbose than data.table sometimes :-)

1 Comment

May I ask why my answer was downvoted? Did it not work ?
1

More trivial solution i used :

library(data.table)

your_df[, lapply(.SD, function(x){
  ifelse(is.na(x), 0, x)
}), .SDcols = is.numeric]

2 Comments

I wouldn't expect this to be fast with large datasets. For every column, you're creating two new vectors with the length of nrow(DT). Also you're not assigning this back to the desired columns so as it stands it doesn't update it but produces a new data.table (of only numeric columns).
@SamR I used the solution proposed earlier: r fun6 <- function(x){ for(j in seq_along(x)){ if (is.numeric(x[[j]]) ) set(x, i = which(is.na(x[[j]])), j = j, value = 0) } } you're right and thanks for your solution, setnafill() is native to data.table, so it's perfect.
1

To update NA values in numeric columns in a data.table by reference you can use setnafill(), which was introduced in 2019. This can replace NA values with either values carried forwards/backwards or a constant, e.g. 0.

Using the data from akrun's answer:

setnafill(
    DT,
    type = "const",
    fill = 0,
    cols = names(DT)[sapply(DT, is.numeric)]
)
#       v1     v2         v3
#    <int> <char>      <num>
# 1:     0   <NA> -0.5458808
# 2:     1      A  0.5365853
# 3:     2      B  0.4196231
# 4:     3      C -0.5836272
# 5:     4      D  0.0000000

Benchmark

This approach is much faster than looping through the columns in R. Here is a benchmark against the fastest approach in the previous benchmark from dmi3kno. This runs with between 100 and 100k rows, and between 10 and 1000 columns. With 100k rows and 1k columns, setnafill() is about 6 times faster than the other approach (total time 1.39s vs 8.27s).

for_set_if <- function(x) {
    for (j in seq_along(x)) {
        if (is.numeric(x[[j]])) {
            set(x, i = which(is.na(x[[j]])), j = j, value = 0)
        }
    }
}
results <- bench::press(
    n_rows = 10^(1:5),
    n_cols = 10^(0:4),
    {
        DT <- data.table(sapply(seq(n_cols), \(.) sample(randomNums, n_rows)))
        bench::mark(
            relative = TRUE,
            check = FALSE,
            for.set.if = for_set_if(copy(DT)),
            setnafill = {
                setnafill(
                    copy(DT),
                    type = "const",
                    fill = 0,
                    cols = names(DT)[sapply(DT, is.numeric)]
                )
            }
        )
    }
)
ggplot2::autoplot(results)

benchmark results

Full benchmark results:

# A tibble: 50 × 11
   expression n_rows n_cols   min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
   <bch:expr>  <dbl>  <dbl> <dbl>  <dbl>     <dbl>     <dbl>    <dbl> <int> <dbl>   <bch:tm>
 1 for.set.if     10      1  1      1         1.64      1        1     9996     4   253.77ms
 2 setnafill      10      1  2.06   1.95      1         3.00     1.07  9993     7   416.34ms
 3 for.set.if    100      1  1      1         1.98      1        1     9996     4   208.58ms
 4 setnafill     100      1  2.00   1.99      1         2.76     1.02  8676     7    357.9ms
 5 for.set.if   1000      1  1      1         1.84      1        1     9996     4   221.85ms
 6 setnafill    1000      1  1.81   1.81      1         1.75     1.09  9992     8   408.56ms
 7 for.set.if  10000      1  1.41   1.39      1         1.75     1     7215     3   483.95ms
 8 setnafill   10000      1  1      1         1.28      1        2.55  8459     7    443.1ms
 9 for.set.if 100000      1  6.98   6.10      1         3.15     1      694     2   491.34ms
10 setnafill  100000      1  1      1         5.72      1        2.03  3906     4   483.69ms
11 for.set.if     10     10  1.81   1.72      1         1        1.95  3737    10   471.84ms
12 setnafill      10     10  1      1         1.76     11.9      1     6412     5   459.47ms
13 for.set.if    100     10  1.86   1.89      1         1        1.62  3379     8    468.8ms
14 setnafill     100     10  1      1         1.95      6.16     1     6669     5   474.95ms
15 for.set.if   1000     10  2.38   2.24      1         1        1.54  2416     6   459.42ms
16 setnafill    1000     10  1      1         2.19      1.50     1     5436     4   471.23ms
17 for.set.if  10000     10  6.00   5.10      1         2.37     1      589     2    489.3ms
18 setnafill   10000     10  1      1         5.27      1        2.04  3048     4   480.36ms
19 for.set.if 100000     10  8.67   8.19      1         3.34     1       66     2   482.05ms
20 setnafill  100000     10  1      1         8.27      1        3.13   523     6   461.94ms
21 for.set.if     10    100  3.63   3.27      1         1        1.64   362     8    470.2ms
22 setnafill      10    100  1      1         3.72     85.6      1     1381     5   482.35ms
23 for.set.if    100    100  3.17   3.21      1         1        2.09   336     8   453.69ms
24 setnafill     100    100  1      1         3.14      9.65     1     1101     4   474.13ms
25 for.set.if   1000    100  4.84   4.22      1         1        1.01   229     5   467.49ms
26 setnafill    1000    100  1      1         4.66      1.46     1     1080     5   472.96ms
27 for.set.if  10000    100  7.63   7.15      1         2.46     1       58     2   478.28ms
28 setnafill   10000    100  1      1         7.24      1        3.12   337     5   383.72ms
29 for.set.if 100000    100  8.58   8.42      1         3.36     1        3     5   208.05ms
30 setnafill  100000    100  1      1         8.26      1        1.05    47    10   394.43ms
31 for.set.if     10   1000  3.33   3.17      1         1        1.95    40    11   370.34ms
32 setnafill      10   1000  1      1         3.08    342.       1      153     7   459.56ms
33 for.set.if    100   1000  4.28   3.35      1         1        1.77    36    10   363.98ms
34 setnafill     100   1000  1      1         3.23     10.3      1      144     7   450.73ms
35 for.set.if   1000   1000  5.30   4.33      1         1        1.23    22     7   360.74ms
36 setnafill    1000   1000  1      1         4.34      1.45     1      118     7   445.31ms
37 for.set.if  10000   1000  7.61   7.39      1         2.47     1        3     3   231.53ms
38 setnafill   10000   1000  1      1         7.41      1        2.06    36    10   375.15ms
39 for.set.if 100000   1000 11.2    6.65      1         3.36     1        1     4    889.3ms
40 setnafill  100000   1000  1      1         6.87      1        1.29     4     3      518ms
41 for.set.if     10  10000  4.48   3.29      1         1        7.18     6    14   585.51ms
42 setnafill      10  10000  1      1         2.11    493.       1       13     2   600.16ms
43 for.set.if    100  10000  3.88   3.50      1         1        4.68     5     5   534.34ms
44 setnafill     100  10000  1      1         3.20     10.4      1       15     1   500.31ms
45 for.set.if   1000  10000  5.35   4.82      1         1        1.41     3     3   543.75ms
46 setnafill    1000  10000  1      1         4.61      1.45     1       13     2   511.45ms
47 for.set.if  10000  10000 12.3   11.4       1         2.48     1        1     3      1.14s
48 setnafill   10000  10000  1      1         7.90      1        3.29     4     5   578.34ms
49 for.set.if 100000  10000  5.94   5.94      1         3.36     1        1     3      8.27s
50 setnafill  100000  10000  1      1         5.94      1        1.98     1     1      1.39s

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.