0

A collegue trying to do analysis came up with a code from chatgpt, doing something wrong, but that I don't understand.

Here is the example:

Let's consider a first table ( drugs: Patient have an id, and start a drug at x):

library(data.table)
df1 <- data.table(id = rep(LETTERS[1:5],each = 3))
set.seed(125)
df1[,x := sample(1:10,.N,replace = T)]

        id     x
    <char> <int>
 1:      A    10
 2:      A     8
 3:      A     8
 4:      B     3
 5:      B     9

Let's consider a second (and main) table (hospital visits, same patients, several hospital stays between two dates y1 and y2) :

df2 <- data.table(id = rep(LETTERS[1:5],each = 2),y1 = c(2,4),y2 = c(6,8))
# unique identifier
df2[,eds_id := 1:.N]

        id    y1    y2 eds_id
    <char> <num> <num>  <int>
 1:      A     2     6      1
 2:      A     4     8      2
 3:      B     2     6      3
 4:      B     4     8      4

Now I want, for each hospital stay, know if any drug was prescribed to the patient during the stay, aka x between y1 and y2, for any drug.

I would do non-equi merge:

df2[df1,xinbetween_true := TRUE,on = .(id,y1 <= x, y2 >= x)]
df2[is.na(xinbetween_true),xinbetween_true := FALSE]

Which work.

ChatGPT came up with:

df2[df1,on = "id",xinbetween := x >= y1 & x <= y2]

Which produce wrong answers:

df2[xinbetween_true != xinbetween]

       id    y1    y2 eds_id xinbetween xinbetween_true
   <char> <num> <num>  <int>     <lgcl>          <lgcl>
1:      B     2     6      3      FALSE            TRUE
2:      C     4     8      6      FALSE            TRUE

For these two entries, the ChatGPT script says no, when it actually has some of the df1 entries respecting the condition:

df2[df1,on = "id",allow.cartesian = T][xinbetween_true != xinbetween]


       id    y1    y2 eds_id xinbetween xinbetween_true     x
   <char> <num> <num>  <int>     <lgcl>          <lgcl> <int>
1:      B     2     6      3      FALSE            TRUE     3
2:      B     2     6      3      FALSE            TRUE     9
3:      B     2     6      3      FALSE            TRUE     9
4:      C     4     8      6      FALSE            TRUE     3
5:      C     4     8      6      FALSE            TRUE     4
6:      C     4     8      6      FALSE            TRUE     3

So is here my question:

What does the df2[df1,on = "id",xinbetween := x >= y1 & x <= y2] script do? It does not do a proper non-equi merge, but I don't get what it does.

And in what case can it be used?

6
  • I don't know if you are asking why did ChatGPT came up with that code. But if yes, then there's no way we can answer. Commented Jan 10 at 12:03
  • No, I don't care much. My question is (see last lines of my post): What does the df2[df1,on = "id",xinbetween := x >= y1 & x <= y2] script do? Commented Jan 10 at 12:10
  • 1
    You can try adding print statements: df2[df1,on = "id", xinbetween := {print(data.table(id, x, y1, y2, x >= y1 & x <= y2)); x >= y1 & x <= y2}] Commented Jan 10 at 12:50
  • @s_baldur that's clever, thanks! Commented Jan 10 at 12:55
  • 1
    You may consider editing the title to be more specific to the question so others with the same question have an easier time finding it when searching Commented Jan 10 at 13:10

1 Answer 1

3

It's important here that both data.tables have duplicated IDs. Thus, df2[df1, on = "id"] is a cartesian join:

df1[, rn := as.character(.I)]

df2[df1, on = "id", allow.cartesian = TRUE]
#        id    y1    y2 eds_id     x     rn
#    <char> <num> <num>  <int> <int> <char>
# 1:      A     2     6      1    10      1
# 2:      A     4     8      2    10      1
# 3:      A     2     6      1     8      2
# 4:      A     4     8      2     8      2
# 5:      A     2     6      1     8      3
# 6:      A     4     8      2     8      3
# 7:      B     2     6      3     3      4
# 8:      B     4     8      4     3      4
# 9:      B     2     6      3     9      5
#10:      B     4     8      4     9      5
#11:      B     2     6      3     9      6
#12:      B     4     8      4     9      6
#13:      C     2     6      5     3      7
#14:      C     4     8      6     3      7
#15:      C     2     6      5     4      8
#16:      C     4     8      6     4      8
#17:      C     2     6      5     3      9
#18:      C     4     8      6     3      9
#19:      D     2     6      7    10     10
#20:      D     4     8      8    10     10
#21:      D     2     6      7     7     11
#22:      D     4     8      8     7     11
#23:      D     2     6      7     5     12
#24:      D     4     8      8     5     12
#25:      E     2     6      9    10     13
#26:      E     4     8     10    10     13
#27:      E     2     6      9     7     14
#28:      E     4     8     10     7     14
#29:      E     2     6      9     6     15
#30:      E     4     8     10     6     15
#        id    y1    y2 eds_id     x     rn

It should be elucidating to store the row numbers from df1 that match/are used for the comparison:

library(data.table)
df1 <- data.table(id = rep(LETTERS[1:5],each = 3))
set.seed(125)
df1[,x := sample(1:10,.N,replace = T)]

df2 <- data.table(id = rep(LETTERS[1:5],each = 2),y1 = c(2,4),y2 = c(6,8))
# unique identifier
df2[,eds_id := 1:.N]

df1[, rn := as.character(.I)]
df2[df1,xinbetween_true := rn,on = .(id,y1 <= x, y2 >= x)]
df2[df1,xinbetween := fifelse(x >= y1 & x <= y2, rn, paste0(rn, "-")), on = "id"]

#        id    y1    y2 eds_id xinbetween_true xinbetween
#    <char> <num> <num>  <int>          <char>     <char>
# 1:      A     2     6      1            <NA>         3-
# 2:      A     4     8      2               3          3
# 3:      B     2     6      3               4         6-
# 4:      B     4     8      4            <NA>         6-
# 5:      C     2     6      5               9          9
# 6:      C     4     8      6               8         9-
# 7:      D     2     6      7              12         12
# 8:      D     4     8      8              12         12
# 9:      E     2     6      9              15         15
#10:      E     4     8     10              15         15

As you see, the ChatGPT code uses the last row from df1 with a matching ID.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for explanation and tricks! I don't get the logic of it though. df2[df1,xinbetween_true := rn,on = .(id,y1 <= x, y2 >= x),allow.cartesian = T] does the same. I was also surprise you don't need to use i.x in the := statement in the merge.
If it isnot ambiguous you can omit i. from column names.
The allow.cartesian is a red herring. It doesn't change the result. To protect users, cartesian joins (except if the result isn't much larger than the input) are usually not computed by data.table. The argument turns that protection off.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.