1

I have a dataset where each row is identified by a hospitalization id. Each row contains information on the hospitalization id, the dates of admission and discharge, as well as the identification of the hospital where it took place and the physician responsible for it.

I would like to know, for each hospitalization, the id of all other hospitalizations concluded in the 30 days before the beginning of the given hospitalization that were performed by other physicians in the same hospital.

Below is a simple example of 8 hospitalizations performed by 2 physicians in 2 hospitals (physicians may work in more than one hospital).

library("tidyverse")
df <- data.frame(hospitalization_id = c(1, 2, 3, 
                                        4, 5, 
                                        6, 7, 8),
                 hospital_id = c("A", "A", "A", 
                                 "A", "A",
                                 "B", "B", "B"),
                 physician_id = c(1, 1, 1, 
                                  2, 2, 
                                  2, 2, 2),
                 date_start = as.Date(c("2000-01-01", "2000-01-12", "2000-01-20",
                                        "2000-01-12", "2000-01-20",
                                        "2000-02-10", "2000-02-11", "2000-02-12")),
                 date_end = as.Date(c("2000-01-03", "2000-01-18", "2000-01-22",
                                      "2000-01-18", "2000-01-22",
                                      "2000-02-11", "2000-02-14", "2000-02-17")))

Using the solution posted in R (dplyr): find all rows in row-specific range: first, I find all other hospitalizations during the 30-day period prior to the start of the given hospitalization in the given hospital; then, I drop those hospitalizations that were performed by the same physician.

df_with_date_range <- df %>%
  mutate(date_range1 = date_start - 31,
         date_range2 = date_start - 1)

df_semifinal <- df_with_date_range %>%
  rowwise() %>%
  mutate(hospital_id_in_range = pmap(list(date_range1, date_range2, hospital_id),
                                     function(x, y, z) ungroup(filter(rowwise(df_with_date_range),
                                                                      between(date_end, x, y),
                                                                      hospital_id == z))$hospitalization_id)) %>%
  unnest(hospital_id_in_range, keep_empty = TRUE)

df_final <- df_semifinal %>% 
  left_join(select(df, hospitalization_id, physician_id),
            by = c('hospital_id_in_range' = 'hospitalization_id')) %>%
  mutate(hospital_id_in_range = ifelse(physician_id.x == physician_id.y, NA, hospital_id_in_range)) %>%
  select(-physician_id.y) %>%
  rename(physician_id = physician_id.x) %>%
  distinct()

I am trying to write a more efficient code given that my data is massive - ideally I would like to avoid just adding all hospitalizations and then dropping the ones performed by the given physician.

3
  • Do you want a dplyr solution or are other packages also ok? Usually, data.table is better suited for large data sets. Commented Aug 15, 2022 at 19:41
  • Thank you @starja - it would be great if you could suggest a code with data.table as well! The only issue is that the rest of my code uses dplyr - is it ok to mix them throughout the code or is this not recommended? Commented Aug 15, 2022 at 22:44
  • 1
    Also, with my solution, I realized that you can eliminate the first rowwise as well with no issue, which should also significantly speed up the process. You can try my edits if you want to. Commented Aug 16, 2022 at 3:03

1 Answer 1

1

What about just adding another filtering criteria for physician in my previous solution as such:

df_with_date_range %>%
  mutate(hospital_id_in_range = pmap(list(date_range1, date_range2, hospital_id, physician_id),
                   function(x, y, z, p) filter(df_with_date_range,
                                                 date_start >= x & date_start <= y,
                                                 hospital_id == z,
                                                 physician_id != p)$hospitalization_id)) %>%
  unnest(hospital_id_in_range, keep_empty = TRUE)

# # A tibble: 9 × 8
#   hospitalization_id hospital_id physician_id date_start date_end   date_range1 date_range2 hospital_id_in_range
#                <dbl> <chr>              <dbl> <date>     <date>     <date>      <date>                     <dbl>
# 1                  1 A                      1 2000-01-01 2000-01-03 1999-12-01  1999-12-31                    NA
# 2                  2 A                      1 2000-01-12 2000-01-18 1999-12-12  2000-01-11                    NA
# 3                  3 A                      1 2000-01-20 2000-01-22 1999-12-20  2000-01-19                     4
# 4                  4 A                      2 2000-01-12 2000-01-18 1999-12-12  2000-01-11                     1
# 5                  5 A                      2 2000-01-20 2000-01-22 1999-12-20  2000-01-19                     1
# 6                  5 A                      2 2000-01-20 2000-01-22 1999-12-20  2000-01-19                     2
# 7                  6 B                      2 2000-02-10 2000-02-11 2000-01-10  2000-02-09                    NA
# 8                  7 B                      2 2000-02-11 2000-02-14 2000-01-11  2000-02-10                    NA
# 9                  8 B                      2 2000-02-12 2000-02-17 2000-01-12  2000-02-11                    NA

Also, I switched the between to a better vectorized implementation that does not require a second rowwise. Using more rowwise tends to slow the procedure down. This one should be a bit faster.

The expected output might be a bit different than yours because I filter during the procedure instead of replacing with NAs afterwards. If you want to replace physician copies with NAs, let me know and I can code a mutate of the filtered data frame to do that.

Sign up to request clarification or add additional context in comments.

7 Comments

there seems to be no rowwise in your code any longer. You said you eliminated the second rowwise but I also don't see it anywhere in your code. Just checking :) Thank you!
Yes, I got rid of the rowwise as it was actually unnecessary, this is why you do not see it after my edits. The rowwise was not needed because of the pmap and the code should run much faster this way. Let me know if you have any further questions!
Thank you!! Now I am trying to sort out another complication from my data: the same hospitalization may involve more than one physician. I would only like to add the other hospitalizations that do not involve the given physician. E.g. for instance, if I am a physician who have worked in the past 30 days with someone else in another hospitalization, I do not want this hospitalization to be added to the dataset. I would like only hospitalizations that did not involve me to be added to the dataset.
I have updated the example and managed to write a code where I find (i) a list with all the hospitalizations in the past 30 days in the hospital, (ii) a list of hospitalizations performed in the past 30 days by the given physician. Then, I take the elements in list (i) which are not in (ii). The only issue is that a same hospitalization shows up multiple times in list (i) in case it involves multiple physicians. Could I share my code with you? If so, where should I post it: here or do I create a new thread?
I'd be happy to take a look. Just make a new thread, as it is a different question. Make sure to highlight where the current code goes wrong, and what the expected output is.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.