1

I have two data.tables which I'm trying to merge. However, these rows in these data.tables need a large number of variables to avoid duplicates. Due to confidentiality data we don't have identifier variables and I need a conjunction of several variables to match these two datasets.

I tried to join them however once I look at the final dataset the variable is empty. All the values are set to NULL. data1 has 17440 observations and 57 variables. old_data has 17347 observations and 12 variables. I need 11 variables to get unique observations, let's name them key_variables. Here's what I have:

key_variables <- c("sex", "birthdate", "sint", "cons", "diag", "concelho", "Serologia", "alcohol", "end", "micro")

setkeyv(data1, key_variables)
setkeyv(old_data, key_variables)

dataFinal <- merge(data1, old_data, key_variables, all.x = T)

The variable I'm trying to add to data1 is a factor. I tried to change to character but I still get the variable set to NULL. Any idea of what could be causing this issue?

str(old_data)
Classes ‘data.table’ and 'data.frame':  17347 obs. of  12 variables:
 $ sex            : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 1 1 ...
  ..- attr(*, "llevels")= int  1 2
  ..- attr(*, "label")= chr "Sex"
 $ birthdate      : labelled, format: NA NA ...
 $ diagnosis_date : labelled, format: "2009-01-09" "2009-10-15" ...
 $ county         : Factor w/ 300 levels "Lisboa","Sines",..: 23 62 244 34 18 37 1 27 60 66 ...
  ..- attr(*, "llevels")= int  11 1 2 3 4 5 6 7 8 9 ...
  ..- attr(*, "label")= chr "County"

str(data)
Classes ‘data.table’ and 'data.frame':  17440 obs. of  57 variables:
  $ ID               : chr  "12083" "12084" "12087" "12096" ...
  $ sex              : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 1 1 ...
   ..- attr(*, "llevels")= int  1 2
  $ birthdate        : Date, format: NA NA ...
  $ county           : Factor w/ 300 levels "Lisboa","Sines",..: 17 17 50 235 25 84 28 1 20 47 ...
   ..- attr(*, "llevels")= int  10 1 2 3 4 5 6 7 8 9 ..


dput(data1)
structure(list(sex = c("Masculino", "Masculino", "Masculino"), 
birthdate = c("4/23/1952", "11/26/1964", "01/08/1965"), sint = c("01/01/2014", 
"09/01/2010", "01/01/2008"), cons = c("02/10/2014", "12/01/2010", 
"1/29/2008"), diag = c("02/10/2014", "12/03/2010", "02/03/2008"
), concelho = c("vila velha de ródão", "vila velha de ródão", 
"vila velha de ródão"), Serologia = c("Não", "Não", "Não"
), alcohol = c("Sim", "Não", "Sim"), end = c("11/03/2014", 
"10/10/2011", "9/17/2008"), micro = c("03/11/2008", "12/03/2010", 
"02/03/2008"), DInflamatoriaArticular = c("Não", "Não", "Não"
)), row.names = c(NA, -3L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x000001f2af621ef0>)

dput(old_data)
structure(list(sex = c("Masculino", "Masculino", "Masculino"), 
birthdate = c("23/04/1952", "26/11/1964", "08/01/1965"), 
age = c(61L, 46L, 43L), concelho = c("vila velha de ródão", 
"vila velha de ródão", "vila velha de ródão"), EstadoVital = c("Vivo", 
"Vivo", "Vivo"), sint = c("01/01/2014", "01/09/2010", "01/01/2008"
), cons = c("10/02/2014", "01/12/2010", "29/01/2008"), alcohol = c("Sim", 
"Não", "Sim"), drugs = c("Não", "Não", "Não"), micro = c("11/03/2008", 
"03/12/2010", "03/02/2008"), diag = c("10/02/2014", "03/12/2010", 
"03/02/2008"), Serologia = c("Não", "Não", "Não"), end = c("03/11/2014", 
"10/10/2011", "17/09/2008"), Motivotermotratamento = c("Tratamento Completado", 
"Tratamento Completado", "Tratamento Completado"), ano = c(2014L, 
2010L, 2008L), region = c("Centro", "Centro", "Centro")), row.names = c(NA,-3L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x000001f2af621ef0>)
10
  • Are you sure that the content of the key_variables of both data.tables are matching? I'd check with e.g.: unique(data1[[key_variables[1]]]) %in% unique(old_data[[key_variables[1]]]). Maybe there is a small difference in the content preventing the join? Commented Oct 22, 2019 at 11:55
  • Yes, the output is TRUE TRUE. I was comparing the data.tables and one difference would be the labels. One data.table has variable labels and another don't. I wouldn't imagine why that would be an issue though.. Commented Oct 22, 2019 at 12:11
  • Is it TRUE for all key_variables? Do you have the possibility to create a dummy dataset to reproduce the problem? Commented Oct 22, 2019 at 12:14
  • I added the structure of the data.tables. old_data has an attribute which is the variable label. Now that I'm looking at the structures side by side, the problem could be differences in factor levels, right? county has 300 levels, if the levels don't match there's no merge? Commented Oct 22, 2019 at 13:00
  • Yes, I guess you have to align the structure of your factor variables. Commented Oct 22, 2019 at 13:33

1 Answer 1

1

As you have already mentioned in the comments the time formats of both tables differ. Here is a possibilty to align them:

library(data.table)

key_variables <-
  c(
    "sex",
    "birthdate",
    "sint",
    "cons",
    "diag",
    "concelho",
    "Serologia",
    "alcohol",
    "end",
    "micro"
  )

data1 <-
  structure(
    list(
      sex = c("Masculino", "Masculino", "Masculino"),
      birthdate = c("4/23/1952", "11/26/1964", "01/08/1965"),
      sint = c("01/01/2014",
               "09/01/2010", "01/01/2008"),
      cons = c("02/10/2014", "12/01/2010",
               "1/29/2008"),
      diag = c("02/10/2014", "12/03/2010", "02/03/2008"),
      concelho = c("vila velha de ródão", "vila velha de ródão",
                   "vila velha de ródão"),
      Serologia = c("Não", "Não", "Não"),
      alcohol = c("Sim", "Não", "Sim"),
      end = c("11/03/2014",
              "10/10/2011", "9/17/2008"),
      micro = c("03/11/2008", "12/03/2010",
                "02/03/2008"),
      DInflamatoriaArticular = c("Não", "Não", "Não")
    ),
    row.names = c(NA,-3L),
    class = c("data.table", "data.frame")
  )

old_data <-
  structure(
    list(
      sex = c("Masculino", "Masculino", "Masculino"),
      birthdate = c("23/04/1952", "26/11/1964", "08/01/1965"),
      age = c(61L, 46L, 43L),
      concelho = c("vila velha de ródão",
                   "vila velha de ródão", "vila velha de ródão"),
      EstadoVital = c("Vivo",
                      "Vivo", "Vivo"),
      sint = c("01/01/2014", "01/09/2010", "01/01/2008"),
      cons = c("10/02/2014", "01/12/2010", "29/01/2008"),
      alcohol = c("Sim",
                  "Não", "Sim"),
      drugs = c("Não", "Não", "Não"),
      micro = c("11/03/2008",
                "03/12/2010", "03/02/2008"),
      diag = c("10/02/2014", "03/12/2010",
               "03/02/2008"),
      Serologia = c("Não", "Não", "Não"),
      end = c("03/11/2014",
              "10/10/2011", "17/09/2008"),
      Motivotermotratamento = c(
        "Tratamento Completado",
        "Tratamento Completado",
        "Tratamento Completado"
      ),
      ano = c(2014L,
              2010L, 2008L),
      region = c("Centro", "Centro", "Centro")
    ),
    row.names = c(NA, -3L),
    class = c("data.table", "data.frame")
  )

setkeyv(data1, key_variables)
setkeyv(old_data, key_variables)

data1[, c("birthdate", "sint", "cons", "diag", "end", "micro") := lapply(.SD, as.Date, format = "%m/%d/%Y"), .SDcols = c("birthdate", "sint", "cons", "diag", "end", "micro")]
old_data[, c("birthdate", "sint", "cons", "diag", "end", "micro") := lapply(.SD, as.Date, format = "%d/%m/%Y"), .SDcols = c("birthdate", "sint", "cons", "diag", "end", "micro")]

dataFinal <- merge(data1, old_data, key_variables)
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you! That's exactly what I've just done. Btw, I didn't lie, this was TRUE when I did unique(data1[[key_variables[2]]]) %in% unique(old_data[[key_variables[2]]]), with the different data structures. I guess with the rush I forgot to check the date formats and whether there was a match between the datasets =P silly mistake! Thank you for help!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.