replace missing values using other rows only when other columns are the same in R

Question

I guess that other people have already looked for it but couldn't find what I'm looking for.

I want to replace NA values with the value of the row above, only when all other values are the same. Bonus point for data.table solution.

Right now, I've managed to do it only with a (very inefficient) loop.

In addition, my current code does not replace NA in case that there are two NA's in the same row.

I have a strong feeling that I'm overthinking this problem. Any ideas of making this stuff easier?

ex <- data.table(
    id = c(1, 1, 2, 2),
    attr1 = c(NA, NA, 3, 3),
    attr2 = c(2, 2, NA, 3),
    attr3 = c(NA, 2, 2, 1),
    attr4 = c(1, 1, 1, 3)
)

desired_ex <- data.table(
    id = c(1, 1, 2, 2),
    attr1 = c(NA, NA, 3, 3),
    attr2 = c(2, 2, NA, 3),
    attr3 = c(2, 2, 2, 1),
    attr4 = c(1, 1, 1, 3)
)

col_names <- paste0("attr", 1:4)
r<-1
for (r in 1:nrow(ex)) {
    print(r)
    to_check <- col_names[colSums(is.na(ex[r, .SD, .SDcols = col_names])) >0]
    if (length(to_check) == 0) {
        print("no NA- next")
        next
    }
    
    for (col_check in to_check) {
        .ex <- copy(ex)[seq(from = r, to = r + 1), ]
        .ex[[col_check]] <- NULL
        if (nrow(unique(.ex)) == 1) {
            ex[[col_check]][r] <- ex[[col_check]][r + 1]
        }
    }
}

all.equal(ex, desired_ex)

I am not really sure what you want to do. Can you please explain in more detail? For example, why in desired_ex attr2 has an NA but it is replaced in the attr3? — bird
– bird, Commented Jul 14, 2021 at 8:45
Look at rows 1:2, apart from the NA in attr3, they are the same. Thus I would like to replace the NA with the value in the other line. However, this is not the case for rows 3:4, I see them as different rows, as apart from the NA in attr2, they differ in attr3 and attr4. Does it make more sense now? — AverageGuy
– AverageGuy, Commented Jul 14, 2021 at 9:16

Uwe · Accepted Answer · 2021-07-27 09:53:33Z

Here is a solution which will work for an arbitrary number of rows and columns within each id not just pairs of rows:

library(data.table)
ex[,  
   if (all(unlist(lapply(.SD, \(x) all(first(x) == x, na.rm = TRUE))))) {
     lapply(.SD, \(x) rep(fcoalesce(as.list(x)), .N)) 
   } else {
     .SD
   }, by  = id]

or, more compact,

ex[, if (all(unlist(lapply(.SD, \(x) all(first(x) == x, na.rm = TRUE))))) 
  lapply(.SD, \(x) rep(fcoalesce(as.list(x)), .N)) else .SD, by  = id]

   id attr1 attr2 attr3 attr4
1:  1    NA     2     2     1
2:  1    NA     2     2     1
3:  2     3    NA     2     1
4:  2     3     3     1     3

Explanation

For each id it is checked if the rows fulfill the condition. If not .SD is returned unchanged. If the condition is fulfilled a new .SD is created by picking the first non-NA value in each column (or NA in case of all NA) using fcoalesce() and replicating this value as many times as there are rows in .SD.

The check for the condition consists of 2 parts. First, it is checked for each column in .SD if all values are identical thereby ignoring any NA. Finally, it is checked if this is TRUE for all columns.

Note that .SD is a data.table containing the Subset of Data for each group, excluding any columns used in by.

Another use case with more rows and columns

ex2 <- fread("
 id   foo   bar   baz attr4 attr5
  1    NA     2    NA     1     5
  1    NA     2     2     1    NA
  1    NA     2    NA    NA    NA
  2     3    NA     2     1     2
  2     3     3     1     3     2
  2     3     3     1     4     2
  3     5     2    NA     1     3
  3    NA     2     2     1     3
  4    NA    NA    NA    NA    NA
")

ex2[, if (sum(unlist(lapply(.SD, \(x) all(first(x) == x, na.rm = TRUE)))) == ncol(.SD)) 
  lapply(.SD, \(x) rep(fcoalesce(as.list(x)), .N)) else .SD, by  = id]

   id foo bar baz attr4 attr5
1:  1  NA   2   2     1     5
2:  1  NA   2   2     1     5
3:  1  NA   2   2     1     5
4:  2   3  NA   2     1     2
5:  2   3   3   1     3     2
6:  2   3   3   1     4     2
7:  3   5   2   2     1     3
8:  3   5   2   2     1     3
9:  4  NA  NA  NA    NA    NA

chinsoon12 · Accepted Answer · 2021-07-14 10:13:51Z

0

Here is an option mixing base R with data.table:

#lead the values for comparison
cols <- paste0("attr", 1L:4L)
lcols <- paste0("lead_", cols)
ex[, (lcols) := shift(.SD, -1L), id]

#check which rows fulfill the criteria
flags <- apply(ex[, ..cols] == ex[, ..lcols], 1L, all, na.rm=TRUE) & 
    apply(ex[, ..lcols], 1L, function(x) !all(is.na(x)))

#update those rows with values from row below
ex[(flags), (cols) := 
    mapply(function(x, y) fcoalesce(x, y), mget(lcols), mget(cols), SIMPLIFY=FALSE)]
ex[, (lcols) := NULL][]

Solution assumes that there is no recursive populating where the row after next is used to fill the current row if criteria is met.

answered Jul 14, 2021 at 10:13

chinsoon12

25.2k4 gold badges27 silver badges35 bronze badges

Collectives™ on Stack Overflow

replace missing values using other rows only when other columns are the same in R

2 Answers 2

Explanation

Another use case with more rows and columns

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Explanation

Another use case with more rows and columns

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related