Wrong variable comparison result when performing data.table merge of two table with duplicated keys

Question

A collegue trying to do analysis came up with a code from chatgpt, doing something wrong, but that I don't understand.

Here is the example:

Let's consider a first table ( drugs: Patient have an id, and start a drug at x):

library(data.table)
df1 <- data.table(id = rep(LETTERS[1:5],each = 3))
set.seed(125)
df1[,x := sample(1:10,.N,replace = T)]

        id     x
    <char> <int>
 1:      A    10
 2:      A     8
 3:      A     8
 4:      B     3
 5:      B     9

Let's consider a second (and main) table (hospital visits, same patients, several hospital stays between two dates y1 and y2) :

df2 <- data.table(id = rep(LETTERS[1:5],each = 2),y1 = c(2,4),y2 = c(6,8))
# unique identifier
df2[,eds_id := 1:.N]

        id    y1    y2 eds_id
    <char> <num> <num>  <int>
 1:      A     2     6      1
 2:      A     4     8      2
 3:      B     2     6      3
 4:      B     4     8      4

Now I want, for each hospital stay, know if any drug was prescribed to the patient during the stay, aka x between y1 and y2, for any drug.

I would do non-equi merge:

df2[df1,xinbetween_true := TRUE,on = .(id,y1 <= x, y2 >= x)]
df2[is.na(xinbetween_true),xinbetween_true := FALSE]

Which work.

ChatGPT came up with:

df2[df1,on = "id",xinbetween := x >= y1 & x <= y2]

Which produce wrong answers:

df2[xinbetween_true != xinbetween]

       id    y1    y2 eds_id xinbetween xinbetween_true
   <char> <num> <num>  <int>     <lgcl>          <lgcl>
1:      B     2     6      3      FALSE            TRUE
2:      C     4     8      6      FALSE            TRUE

For these two entries, the ChatGPT script says no, when it actually has some of the df1 entries respecting the condition:

df2[df1,on = "id",allow.cartesian = T][xinbetween_true != xinbetween]


       id    y1    y2 eds_id xinbetween xinbetween_true     x
   <char> <num> <num>  <int>     <lgcl>          <lgcl> <int>
1:      B     2     6      3      FALSE            TRUE     3
2:      B     2     6      3      FALSE            TRUE     9
3:      B     2     6      3      FALSE            TRUE     9
4:      C     4     8      6      FALSE            TRUE     3
5:      C     4     8      6      FALSE            TRUE     4
6:      C     4     8      6      FALSE            TRUE     3

So is here my question:

What does the df2[df1,on = "id",xinbetween := x >= y1 & x <= y2] script do? It does not do a proper non-equi merge, but I don't get what it does.

And in what case can it be used?

I don't know if you are asking why did ChatGPT came up with that code. But if yes, then there's no way we can answer. — Rui Barradas
– Rui Barradas, Commented Jan 10 at 12:03
No, I don't care much. My question is (see last lines of my post): What does the df2[df1,on = "id",xinbetween := x >= y1 & x <= y2] script do? — denis
– denis, Commented Jan 10 at 12:10
You can try adding print statements: df2[df1,on = "id", xinbetween := {print(data.table(id, x, y1, y2, x >= y1 & x <= y2)); x >= y1 & x <= y2}] — s_baldur
– s_baldur, Commented Jan 10 at 12:50
You may consider editing the title to be more specific to the question so others with the same question have an easier time finding it when searching — jpsmith
– jpsmith, Commented Jan 10 at 13:10

Roland · Accepted Answer · 2025-01-10 12:50:50Z

3

It's important here that both data.tables have duplicated IDs. Thus, df2[df1, on = "id"] is a cartesian join:

df1[, rn := as.character(.I)]

df2[df1, on = "id", allow.cartesian = TRUE]
#        id    y1    y2 eds_id     x     rn
#    <char> <num> <num>  <int> <int> <char>
# 1:      A     2     6      1    10      1
# 2:      A     4     8      2    10      1
# 3:      A     2     6      1     8      2
# 4:      A     4     8      2     8      2
# 5:      A     2     6      1     8      3
# 6:      A     4     8      2     8      3
# 7:      B     2     6      3     3      4
# 8:      B     4     8      4     3      4
# 9:      B     2     6      3     9      5
#10:      B     4     8      4     9      5
#11:      B     2     6      3     9      6
#12:      B     4     8      4     9      6
#13:      C     2     6      5     3      7
#14:      C     4     8      6     3      7
#15:      C     2     6      5     4      8
#16:      C     4     8      6     4      8
#17:      C     2     6      5     3      9
#18:      C     4     8      6     3      9
#19:      D     2     6      7    10     10
#20:      D     4     8      8    10     10
#21:      D     2     6      7     7     11
#22:      D     4     8      8     7     11
#23:      D     2     6      7     5     12
#24:      D     4     8      8     5     12
#25:      E     2     6      9    10     13
#26:      E     4     8     10    10     13
#27:      E     2     6      9     7     14
#28:      E     4     8     10     7     14
#29:      E     2     6      9     6     15
#30:      E     4     8     10     6     15
#        id    y1    y2 eds_id     x     rn

It should be elucidating to store the row numbers from df1 that match/are used for the comparison:

library(data.table)
df1 <- data.table(id = rep(LETTERS[1:5],each = 3))
set.seed(125)
df1[,x := sample(1:10,.N,replace = T)]

df2 <- data.table(id = rep(LETTERS[1:5],each = 2),y1 = c(2,4),y2 = c(6,8))
# unique identifier
df2[,eds_id := 1:.N]

df1[, rn := as.character(.I)]
df2[df1,xinbetween_true := rn,on = .(id,y1 <= x, y2 >= x)]
df2[df1,xinbetween := fifelse(x >= y1 & x <= y2, rn, paste0(rn, "-")), on = "id"]

#        id    y1    y2 eds_id xinbetween_true xinbetween
#    <char> <num> <num>  <int>          <char>     <char>
# 1:      A     2     6      1            <NA>         3-
# 2:      A     4     8      2               3          3
# 3:      B     2     6      3               4         6-
# 4:      B     4     8      4            <NA>         6-
# 5:      C     2     6      5               9          9
# 6:      C     4     8      6               8         9-
# 7:      D     2     6      7              12         12
# 8:      D     4     8      8              12         12
# 9:      E     2     6      9              15         15
#10:      E     4     8     10              15         15

As you see, the ChatGPT code uses the last row from df1 with a matching ID.

answered Jan 10 at 12:50

Roland

134k12 gold badges203 silver badges305 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

denis Jan 10 at 12:58

Thanks for explanation and tricks! I don't get the logic of it though. df2[df1,xinbetween_true := rn,on = .(id,y1 <= x, y2 >= x),allow.cartesian = T] does the same. I was also surprise you don't need to use i.x in the := statement in the merge.

Roland Jan 10 at 13:44

If it isnot ambiguous you can omit i. from column names.

Roland Jan 10 at 13:46

The allow.cartesian is a red herring. It doesn't change the result. To protect users, cartesian joins (except if the result isn't much larger than the input) are usually not computed by data.table. The argument turns that protection off.

Collectives™ on Stack Overflow

Wrong variable comparison result when performing data.table merge of two table with duplicated keys

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related