2

I'm trying to do a rather complicated filter using if else and I keep getting turned around in circles.

Here is a sample of the data:

individual_id born        mom_id trap_date 
        13212 2013-03-16  11926  2014-06-26
        13213 2013-03-16  11926  2013-07-23
        13214 2013-03-16  11926  2015-06-01
        13212 2013-03-16  11926  2014-06-26
        11926 2010-04-08  10422  2013-01-23
        11926 2010-04-08  10422  2013-02-01
        11926 2010-04-08  10422  2013-03-08
        11926 2010-04-08  10422  2013-03-12
        11926 2010-04-08  10422  2013-04-23
        11926 2010-04-08  10422  2013-05-01
        11926 2010-04-08  10422  2013-08-23
        11926 2010-04-08  10422  2013-09-01
        11926 2010-04-08  10422  2013-10-23
        11926 2010-04-08  10422  2013-12-01

I have a column called mom_id and these individuals also show up in the individual_id column.

I'd like to remove any trap_date records that occur - 35 days or + 70 days from the born date for a mom_id, only if she also shows up in the individual_id column.

So, in this example, I'd like to remove all trap_date records/rows that are between February 9, 2013 (2013-02-09) and May 25, 2013 (2013-05-25) because the born date for mom_id 11926 is March 16, 2013 (2013-03-16).

Desired result:

individual_id born        mom_id trap_date 
        13212 2013-03-16  11926  2014-06-26
        13213 2013-03-16  11926  2013-07-23
        13214 2013-03-16  11926  2015-06-01
        13212 2013-03-16  11926  2014-06-26
        11926 2010-04-08  10422  2013-01-23
        11926 2010-04-08  10422  2013-02-01
        11926 2010-04-08  10422  2013-08-23
        11926 2010-04-08  10422  2013-09-01
        11926 2010-04-08  10422  2013-10-23
        11926 2010-04-08  10422  2013-12-01

A dplyr solution would be appreciated, but I'd appreciate any help at this point!

3
  • Perhaps, I am misled but I believe born date for mom_id 11926 is 2010-04-08 not 2013-03-16 which is the born date for mom_id 19262, isn't it? The ids look confusingly similar. Commented Jul 15, 2021 at 17:59
  • That's a fair question! Because individuals show up in both the mom_id and individual_id column, it is confusing to figure out what born corresponds to. But, born is the date mom_id has to be filtered from (so mom_id==11926 has a born date of 2013-03-16 and mom_id==10422 has a born date of 2010-04-08). Commented Jul 15, 2021 at 18:07
  • Thanks for clarification! Commented Jul 15, 2021 at 18:20

3 Answers 3

2

I think this does what you're looking for.

mutate(df, in_individual = mom_id %in% individual_id) %>%
    filter((!in_individual) | ((born - 35 <= trap_date) & (trap_date <= born + 70)))
Sign up to request clarification or add additional context in comments.

2 Comments

Is there a typo in your solution? I get this error: Error: unexpected '<=' in: " in_individual = mom_id %in% individual_id) %>% filter((!in_id) || born - 35 <= trap_date <="
Sorry, I forgot R doesn't support chaining comparison operators. I've edited my answer.
1

If I understand correctly, this can be solved using a non-equi anti-join:

library(data.table)
df2 <- unique(setDT(df1)[, .(mom_id, lo = born - 35, hi = born + 70)])
df1[!df1[df2, on = c("individual_id == mom_id", "trap_date >= lo", "trap_date <= hi"),
    nomatch = NULL, which = TRUE]]
    individual_id       born mom_id  trap_date
 1:         19262 2013-03-16  11926 2014-06-26
 2:         19263 2013-03-16  11926 2013-07-23
 3:         19264 2013-03-16  11926 2015-06-01
 4:         19262 2013-03-16  11926 2014-06-26
 5:         11926 2010-04-08  10422 2013-01-23
 6:         11926 2010-04-08  10422 2013-02-01
 7:         11926 2010-04-08  10422 2013-08-23
 8:         11926 2010-04-08  10422 2013-09-01
 9:         11926 2010-04-08  10422 2013-10-23
10:         11926 2010-04-08  10422 2013-12-01

Explanation

  • For each unique mom_id the range lo to hiof dates to be removed is computed. So, df2 becomes

       mom_id         lo         hi
    1:  11926 2013-02-09 2013-05-25
    2:  10422 2010-03-04 2010-06-17
    
  • Now, the row numbers of the rows to be removed are identified by a non-equi join

     df1[df2, on = c("individual_id == mom_id", "trap_date >= lo", "trap_date <= hi"),
         nomatch = NULL, which = TRUE]  
    

    which returns

    [1] 7 8 9 10

  • Finally, these rows are removed from df1 by subsetting (using !)

Data

library(data.table)
df1 <- fread("individual_id born        mom_id trap_date 
        19262 2013-03-16  11926  2014-06-26
        19263 2013-03-16  11926  2013-07-23
        19264 2013-03-16  11926  2015-06-01
        19262 2013-03-16  11926  2014-06-26
        11926 2010-04-08  10422  2013-01-23
        11926 2010-04-08  10422  2013-02-01
        11926 2010-04-08  10422  2013-03-08
        11926 2010-04-08  10422  2013-03-12
        11926 2010-04-08  10422  2013-04-23
        11926 2010-04-08  10422  2013-05-01
        11926 2010-04-08  10422  2013-08-23
        11926 2010-04-08  10422  2013-09-01
        11926 2010-04-08  10422  2013-10-23
        11926 2010-04-08  10422  2013-12-01")

2 Comments

Are there any other packages I should have installed? R is telling me it could not find function "."
@BlunderingEcologist, I was able to reproduce the issue. It can be fixed by coercing df1 to class data.table. See my update. I am sorry for the inconvenience.
0

First we convert both born and trap_date to dates, in case they're only text. Then we check the mom_id. Finally, we check the trap_date.

df %>%
  mutate(
    across(c(born, trap_date), lubridate::ymd), # applies ymd parser over both columns
    in_individual = mom_id %in% individual_id,
    within_interval = 
      trap_date %within% 
        interval(
          start = born - 35,
          end = .born + 70)) %>%
    filter(in_individual, within_interval) # filter out conditional on both

Edit

df %>%
  mutate(
    across(c(born, trap_date), lubridate::ymd), # applies ymd parser over both columns
    in_individual = mom_id %in% individual_id,
    within_interval = purrr::map2_lgl(
      .x = trap_date,
      .y = born,
      .f = ~ .x %within%
        interval(
          start = .y - 35,
          end = .y + 70)) %>%
    filter(in_individual, within_interval) # filter out conditional on both

2 Comments

I get an error for this the within-interval part of your solution: Error: Problem with mutate() input within_interval.
I've editted my answer. Mapping over both inputs should work just fine.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.