I have a dataset where each row is identified by a hospitalization id. Each row contains information on the hospitalization id, the dates of admission and discharge, as well as the identification of the hospital where it took place and the physician responsible for it.
I would like to know, for each hospitalization, the id of all other hospitalizations concluded in the 30 days before the beginning of the given hospitalization that were performed by other physicians in the same hospital.
Below is a simple example of 8 hospitalizations performed by 2 physicians in 2 hospitals (physicians may work in more than one hospital).
library("tidyverse")
df <- data.frame(hospitalization_id = c(1, 2, 3,
4, 5,
6, 7, 8),
hospital_id = c("A", "A", "A",
"A", "A",
"B", "B", "B"),
physician_id = c(1, 1, 1,
2, 2,
2, 2, 2),
date_start = as.Date(c("2000-01-01", "2000-01-12", "2000-01-20",
"2000-01-12", "2000-01-20",
"2000-02-10", "2000-02-11", "2000-02-12")),
date_end = as.Date(c("2000-01-03", "2000-01-18", "2000-01-22",
"2000-01-18", "2000-01-22",
"2000-02-11", "2000-02-14", "2000-02-17")))
Using the solution posted in R (dplyr): find all rows in row-specific range: first, I find all other hospitalizations during the 30-day period prior to the start of the given hospitalization in the given hospital; then, I drop those hospitalizations that were performed by the same physician.
df_with_date_range <- df %>%
mutate(date_range1 = date_start - 31,
date_range2 = date_start - 1)
df_semifinal <- df_with_date_range %>%
rowwise() %>%
mutate(hospital_id_in_range = pmap(list(date_range1, date_range2, hospital_id),
function(x, y, z) ungroup(filter(rowwise(df_with_date_range),
between(date_end, x, y),
hospital_id == z))$hospitalization_id)) %>%
unnest(hospital_id_in_range, keep_empty = TRUE)
df_final <- df_semifinal %>%
left_join(select(df, hospitalization_id, physician_id),
by = c('hospital_id_in_range' = 'hospitalization_id')) %>%
mutate(hospital_id_in_range = ifelse(physician_id.x == physician_id.y, NA, hospital_id_in_range)) %>%
select(-physician_id.y) %>%
rename(physician_id = physician_id.x) %>%
distinct()
I am trying to write a more efficient code given that my data is massive - ideally I would like to avoid just adding all hospitalizations and then dropping the ones performed by the given physician.
dplyrsolution or are other packages also ok? Usually,data.tableis better suited for large data sets.data.tableas well! The only issue is that the rest of my code usesdplyr- is it ok to mix them throughout the code or is this not recommended?rowwiseas well with no issue, which should also significantly speed up the process. You can try my edits if you want to.