0

I guess that other people have already looked for it but couldn't find what I'm looking for.

I want to replace NA values with the value of the row above, only when all other values are the same. Bonus point for data.table solution.

Right now, I've managed to do it only with a (very inefficient) loop.

In addition, my current code does not replace NA in case that there are two NA's in the same row.

I have a strong feeling that I'm overthinking this problem. Any ideas of making this stuff easier?

ex <- data.table(
    id = c(1, 1, 2, 2),
    attr1 = c(NA, NA, 3, 3),
    attr2 = c(2, 2, NA, 3),
    attr3 = c(NA, 2, 2, 1),
    attr4 = c(1, 1, 1, 3)
)

desired_ex <- data.table(
    id = c(1, 1, 2, 2),
    attr1 = c(NA, NA, 3, 3),
    attr2 = c(2, 2, NA, 3),
    attr3 = c(2, 2, 2, 1),
    attr4 = c(1, 1, 1, 3)
)

col_names <- paste0("attr", 1:4)
r<-1
for (r in 1:nrow(ex)) {
    print(r)
    to_check <- col_names[colSums(is.na(ex[r, .SD, .SDcols = col_names])) >0]
    if (length(to_check) == 0) {
        print("no NA- next")
        next
    }
    
    for (col_check in to_check) {
        .ex <- copy(ex)[seq(from = r, to = r + 1), ]
        .ex[[col_check]] <- NULL
        if (nrow(unique(.ex)) == 1) {
            ex[[col_check]][r] <- ex[[col_check]][r + 1]
        }
    }
}

all.equal(ex, desired_ex)
2
  • I am not really sure what you want to do. Can you please explain in more detail? For example, why in desired_ex attr2 has an NA but it is replaced in the attr3? Commented Jul 14, 2021 at 8:45
  • Look at rows 1:2, apart from the NA in attr3, they are the same. Thus I would like to replace the NA with the value in the other line. However, this is not the case for rows 3:4, I see them as different rows, as apart from the NA in attr2, they differ in attr3 and attr4. Does it make more sense now? Commented Jul 14, 2021 at 9:16

2 Answers 2

1

Here is a solution which will work for an arbitrary number of rows and columns within each id not just pairs of rows:

library(data.table)
ex[,  
   if (all(unlist(lapply(.SD, \(x) all(first(x) == x, na.rm = TRUE))))) {
     lapply(.SD, \(x) rep(fcoalesce(as.list(x)), .N)) 
   } else {
     .SD
   }, by  = id]

or, more compact,

ex[, if (all(unlist(lapply(.SD, \(x) all(first(x) == x, na.rm = TRUE))))) 
  lapply(.SD, \(x) rep(fcoalesce(as.list(x)), .N)) else .SD, by  = id]
   id attr1 attr2 attr3 attr4
1:  1    NA     2     2     1
2:  1    NA     2     2     1
3:  2     3    NA     2     1
4:  2     3     3     1     3

Explanation

For each id it is checked if the rows fulfill the condition. If not .SD is returned unchanged. If the condition is fulfilled a new .SD is created by picking the first non-NA value in each column (or NA in case of all NA) using fcoalesce() and replicating this value as many times as there are rows in .SD.

The check for the condition consists of 2 parts. First, it is checked for each column in .SD if all values are identical thereby ignoring any NA. Finally, it is checked if this is TRUE for all columns.

Note that .SD is a data.table containing the Subset of Data for each group, excluding any columns used in by.

Another use case with more rows and columns

ex2 <- fread("
 id   foo   bar   baz attr4 attr5
  1    NA     2    NA     1     5
  1    NA     2     2     1    NA
  1    NA     2    NA    NA    NA
  2     3    NA     2     1     2
  2     3     3     1     3     2
  2     3     3     1     4     2
  3     5     2    NA     1     3
  3    NA     2     2     1     3
  4    NA    NA    NA    NA    NA
")

ex2[, if (sum(unlist(lapply(.SD, \(x) all(first(x) == x, na.rm = TRUE)))) == ncol(.SD)) 
  lapply(.SD, \(x) rep(fcoalesce(as.list(x)), .N)) else .SD, by  = id]
   id foo bar baz attr4 attr5
1:  1  NA   2   2     1     5
2:  1  NA   2   2     1     5
3:  1  NA   2   2     1     5
4:  2   3  NA   2     1     2
5:  2   3   3   1     3     2
6:  2   3   3   1     4     2
7:  3   5   2   2     1     3
8:  3   5   2   2     1     3
9:  4  NA  NA  NA    NA    NA
Sign up to request clarification or add additional context in comments.

Comments

0

Here is an option mixing base R with data.table:

#lead the values for comparison
cols <- paste0("attr", 1L:4L)
lcols <- paste0("lead_", cols)
ex[, (lcols) := shift(.SD, -1L), id]

#check which rows fulfill the criteria
flags <- apply(ex[, ..cols] == ex[, ..lcols], 1L, all, na.rm=TRUE) & 
    apply(ex[, ..lcols], 1L, function(x) !all(is.na(x)))

#update those rows with values from row below
ex[(flags), (cols) := 
    mapply(function(x, y) fcoalesce(x, y), mget(lcols), mget(cols), SIMPLIFY=FALSE)]
ex[, (lcols) := NULL][]

Solution assumes that there is no recursive populating where the row after next is used to fill the current row if criteria is met.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.