Time difference calculated from wide data with missing rows

Question

There is a longitudinal data set in the wide format, from which I want to compute time (in years and days) between the first observation date and the last date an individual was observed. Dates are in the format yyyy-mm-dd. The data set has four observation periods with missing dates, an example is as follows

df1<-data.frame("id"=c(1:4),
           "adate"=c("2011-06-18","2011-06-18","2011-04-09","2011-05-20"),
           "bdate"=c("2012-06-15","2012-06-15",NA,"2012-05-23"),
           "cdate"=c("2013-06-18","2013-06-18","2013-04-09",NA),
           "ddate"=c("2014-06-15",NA,"2014-04-11",NA))

Here "adate" is the first date and the last date is the date an individual was last seen. To compute the time difference (lastdate-adate), I have tried using "lubridate" package, for example

lubridate::time_length(difftime(as.Date("2012-05-23"), as.Date("2011-05-20")),"years")

However, I'm challenged by the fact that the last date is not coming from one column. I'm looking for a way to automate the calculation in R. The expected output would look like

  id years days
1  1  2.99 1093
2  2  2.00  731
3  3  3.01 1098
4  4  1.01  369

Years is approximated to 2 decimal places.

Create a lastdate column: with(df1, pmax(bdate, cdate, ddate)) then calculate the difference between lastdate and adate — Bill O'Brien
– Bill O'Brien, Commented Jul 31, 2020 at 22:10
Possible duplicate:stackoverflow.com/questions/39934159/… or stackoverflow.com/questions/33664659/… — MrFlick
– MrFlick, Commented Jul 31, 2020 at 22:10

M-- · Accepted Answer · 2020-07-31 22:40:15Z

2

Another tidyverse solution can be done by converting the data to long format, removing NA dates, and getting the time difference between last and first date for each id.

library(dplyr)
library(tidyr)
library(lubridate)

df1 %>% 
  pivot_longer(-id) %>% 
  na.omit %>% 
  group_by(id) %>% 
  mutate(value = as.Date(value)) %>% 
  summarise(years = time_length(difftime(last(value), first(value)),"years"),
            days = as.numeric(difftime(last(value), first(value))))

#> # A tibble: 4 x 3
#>      id years  days
#>   <int> <dbl> <dbl>
#> 1     1  2.99  1093
#> 2     2  2.00   731
#> 3     3  3.01  1098
#> 4     4  1.01   369

answered Jul 31, 2020 at 22:40

M--

33.6k12 gold badges74 silver badges115 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

akrun · Accepted Answer · 2020-07-31 22:10:48Z

We could use pmap

library(dplyr)
library(purrr)
library(tidyr)
df1 %>%
    mutate(out = pmap(.[-1], ~ {
      dates <- as.Date(na.omit(c(...)))
      tibble(years = lubridate::time_length(difftime(last(dates), 
            first(dates)), "years"), 
       days = lubridate::time_length(difftime(last(dates), first(dates)), "days"))
           })) %>% 
   unnest_wider(out)
# A tibble: 4 x 7
#     id adate      bdate      cdate      ddate      years  days
#  <int> <chr>      <chr>      <chr>      <chr>      <dbl> <dbl>
#1     1 2011-06-18 2012-06-15 2013-06-18 2014-06-15  2.99  1093
#2     2 2011-06-18 2012-06-15 2013-06-18 <NA>        2.00   731
#3     3 2011-04-09 <NA>       2013-04-09 2014-04-11  3.01  1098
#4     4 2011-05-20 2012-05-23 <NA>       <NA>        1.01   369

Onyambu · Accepted Answer · 2020-07-31 22:56:23Z

1

Probably most of the functions introduced here might be quite complex. You should try to learn them if possible. Although will provide a Base R approach:

grp <- droplevels(interaction(df[,1],row(df[-1]))) # Create a grouping:

days <- tapply(unlist(df[-1]),grp, function(x)max(x,na.rm = TRUE) - x[1]) #Get the difference

cbind(df[1],days, years = round(days/365,2)) # Create your table

    id days years
1.1  1 1093  2.99
2.2  2  731  2.00
3.3  3 1098  3.01
4.4  4  369  1.01

if comfortable with other higher functions then you could do:

dat <- aggregate(adate~id,reshape(df1,list(2:ncol(df1)), dir="long"),function(x)max(x) - x[1])
transform(dat,year = round(adate/365,2))
  id adate  year
1  1 1093  2.99 
2  2  731  2.00 
3  3 1098  3.01 
4  4  369  1.01

edited Jul 31, 2020 at 22:56

answered Jul 31, 2020 at 22:50

Onyambu

80.3k3 gold badges29 silver badges65 bronze badges

Comments

Ronak Shah · Accepted Answer · 2020-08-01 01:13:29Z

1

Using base R apply :

df1[-1] <- lapply(df1[-1], as.Date)

df1[c('years', 'days')] <- t(apply(df1[-1], 1, function(x) {
      x <- na.omit(x)
      x1 <- difftime(x[length(x)], x[1], 'days')
      c(x1/365, x1)
}))

df1[c('id', 'years', 'days')]
#  id    years days
#1  1 2.994521 1093
#2  2 2.002740  731
#3  3 3.008219 1098
#4  4 1.010959  369

answered Aug 1, 2020 at 1:13

Ronak Shah

391k20 gold badges173 silver badges237 bronze badges

Collectives™ on Stack Overflow

Time difference calculated from wide data with missing rows

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related