Using dplyr to check values of multiple rows that meet a condition (ex- all rows where the date column falls in a specified period)

Question

I have a dataset of event ids, the event type, and the time of the event. The events consist of "start" and "pause". I would like to identify "pause" events that are not followed by a "start" event within 7 days and classify these as "stops".

Here is the code for the test dataset:

test <- data.frame("id" = 1:5,
               "event" = c("pause",
                           "pause",
                           "start",
                           "pause",
                           "start"),
               "time" = dmy("03-11-2012",
                            "05-11-2012",
                            "06-11-2012",
                            "21-11-2012",
                            "30-11-2012"))

So far, I used lead() to check if the following event was a "start" event AND happened within 7 days. However, I realized that sometimes a "pause" event was followed by another "pause" event and then a "start" event, all within 7 days. Both "pause" events in this case should not be considered to be a stop. This means that I need to check all events/rows that occurred within 7 days of the "pause" event and look for a "start" event.

I am looking for a function I can use within dplyr (I'll use non-dplyr solutions if I have to) where I can check the value of multiple rows.

My solution so far using lead(), which checks the immediate next row only.

test2 <- test %>%
mutate(stop = ifelse(event == "pause" &
                     !((time + days(7) > lead(time)) & 
                          lead(event) == "start"),
                   "yes",
                   "no"))

This gives

|id|event|time      |stop|
|------------------------|
|1 |pause|2012-11-03|yes |
|2 |pause|2012-11-05|no  |
|3 |start|2012-11-06|no  |
|4 |pause|2012-11-21|yes |
|5 |start|2012-11-30|no  |

I would like the stop column value for the first "pause" to also be a "no" because it has a "start" event within 7 days of it.

Allan Cameron · Accepted Answer · 2020-08-25 16:54:24Z

2

If you want to do this inside a dplyr function, you can sapply inside a mutate:

test %>% 
  mutate(stop = sapply(seq_along(time),
          function(i) {
              if(event[i] != "pause") return(FALSE)
              ind <- which(time > time[i] & event == "start")
              if(length(ind) == 0) return(FALSE)
              as.numeric(difftime(time[ind[1]], time[i], units = "day")) > 7
            }))

#>   id event       time  stop
#> 1  1 pause 2012-11-03 FALSE
#> 2  2 pause 2012-11-05 FALSE
#> 3  3 start 2012-11-06 FALSE
#> 4  4 pause 2012-11-21  TRUE
#> 5  5 start 2012-11-30 FALSE

answered Aug 25, 2020 at 16:54

Allan Cameron

178k7 gold badges70 silver badges118 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

spark Over a year ago

Did not know I could use apply functions within mutate! The answer from @dan also works, I tested both with microbenchmark and this solution seems to be more efficient.

spark Over a year ago

actually shouldn't be if(length(ind) == 0) return(TRUE) ? Both work with the test data, but in the case where a "pause" was followed by more "pause" events only, then the initial "pause" event should be considered a "stop". Sorry if I didn't make that clear in my question

Dan Chaltiel · Accepted Answer · 2020-08-25 17:04:00Z

Although it might get slow with large dataset, this might do the work:

library(dplyr)
library(purrr)
test %>% 
  mutate(
    stop = ifelse(event=="pause" & !((time + days(7) > lead(time)) & lead(event) == "start"),
                  "yes", "no"),
    stop2 = ifelse(map_lgl(row_number(), 
                           ~any(event=="start" & time>=time[.x] & time<=time[.x] + days(7))),
                   "no", "yes")
  )

#   id event       time stop stop2
# 1  1 pause 2012-11-03  yes    no
# 2  2 pause 2012-11-05   no    no
# 3  3 start 2012-11-06   no    no
# 4  4 pause 2012-11-21  yes   yes
# 5  5 start 2012-11-30   no    no

Using row_number() and time[.x], this allows to consider every row independently. Then, we just check if there is any "start" between "now" and "in 7 days" and put the right value accordingly. purrr::map_lgl allows to loop over every row and return a logical vector.

The slowness comes from the fact that you have to check for all the rows each time you want to compute the value for one row.

Collectives™ on Stack Overflow

Using dplyr to check values of multiple rows that meet a condition (ex- all rows where the date column falls in a specified period)

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related