I have two data.tables which I'm trying to merge. However, these rows in these data.tables need a large number of variables to avoid duplicates. Due to confidentiality data we don't have identifier variables and I need a conjunction of several variables to match these two datasets.
I tried to join them however once I look at the final dataset the variable is empty. All the values are set to NULL. data1 has 17440 observations and 57 variables. old_data has 17347 observations and 12 variables. I need 11 variables to get unique observations, let's name them key_variables. Here's what I have:
key_variables <- c("sex", "birthdate", "sint", "cons", "diag", "concelho", "Serologia", "alcohol", "end", "micro")
setkeyv(data1, key_variables)
setkeyv(old_data, key_variables)
dataFinal <- merge(data1, old_data, key_variables, all.x = T)
The variable I'm trying to add to data1 is a factor. I tried to change to character but I still get the variable set to NULL. Any idea of what could be causing this issue?
str(old_data)
Classes ‘data.table’ and 'data.frame': 17347 obs. of 12 variables:
$ sex : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, "llevels")= int 1 2
..- attr(*, "label")= chr "Sex"
$ birthdate : labelled, format: NA NA ...
$ diagnosis_date : labelled, format: "2009-01-09" "2009-10-15" ...
$ county : Factor w/ 300 levels "Lisboa","Sines",..: 23 62 244 34 18 37 1 27 60 66 ...
..- attr(*, "llevels")= int 11 1 2 3 4 5 6 7 8 9 ...
..- attr(*, "label")= chr "County"
str(data)
Classes ‘data.table’ and 'data.frame': 17440 obs. of 57 variables:
$ ID : chr "12083" "12084" "12087" "12096" ...
$ sex : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 1 1 ...
..- attr(*, "llevels")= int 1 2
$ birthdate : Date, format: NA NA ...
$ county : Factor w/ 300 levels "Lisboa","Sines",..: 17 17 50 235 25 84 28 1 20 47 ...
..- attr(*, "llevels")= int 10 1 2 3 4 5 6 7 8 9 ..
dput(data1)
structure(list(sex = c("Masculino", "Masculino", "Masculino"),
birthdate = c("4/23/1952", "11/26/1964", "01/08/1965"), sint = c("01/01/2014",
"09/01/2010", "01/01/2008"), cons = c("02/10/2014", "12/01/2010",
"1/29/2008"), diag = c("02/10/2014", "12/03/2010", "02/03/2008"
), concelho = c("vila velha de ródão", "vila velha de ródão",
"vila velha de ródão"), Serologia = c("Não", "Não", "Não"
), alcohol = c("Sim", "Não", "Sim"), end = c("11/03/2014",
"10/10/2011", "9/17/2008"), micro = c("03/11/2008", "12/03/2010",
"02/03/2008"), DInflamatoriaArticular = c("Não", "Não", "Não"
)), row.names = c(NA, -3L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x000001f2af621ef0>)
dput(old_data)
structure(list(sex = c("Masculino", "Masculino", "Masculino"),
birthdate = c("23/04/1952", "26/11/1964", "08/01/1965"),
age = c(61L, 46L, 43L), concelho = c("vila velha de ródão",
"vila velha de ródão", "vila velha de ródão"), EstadoVital = c("Vivo",
"Vivo", "Vivo"), sint = c("01/01/2014", "01/09/2010", "01/01/2008"
), cons = c("10/02/2014", "01/12/2010", "29/01/2008"), alcohol = c("Sim",
"Não", "Sim"), drugs = c("Não", "Não", "Não"), micro = c("11/03/2008",
"03/12/2010", "03/02/2008"), diag = c("10/02/2014", "03/12/2010",
"03/02/2008"), Serologia = c("Não", "Não", "Não"), end = c("03/11/2014",
"10/10/2011", "17/09/2008"), Motivotermotratamento = c("Tratamento Completado",
"Tratamento Completado", "Tratamento Completado"), ano = c(2014L,
2010L, 2008L), region = c("Centro", "Centro", "Centro")), row.names = c(NA,-3L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x000001f2af621ef0>)
key_variablesof bothdata.tables are matching? I'd check with e.g.:unique(data1[[key_variables[1]]]) %in% unique(old_data[[key_variables[1]]]). Maybe there is a small difference in the content preventing the join?TRUEfor allkey_variables? Do you have the possibility to create a dummy dataset to reproduce the problem?old_datahas an attribute which is the variable label. Now that I'm looking at the structures side by side, the problem could be differences in factor levels, right?countyhas 300 levels, if the levels don't match there's no merge?