data.table assignment by reference modifies wrong object

Question

I experience some unexpected behavior when using grouped modification of a column in a data.table:

# creating a data.frame
data <- data.frame(sequence = rep(c("A","B","C","D"), c(2,3,3,2)), trim = 0, random_value = NA)
data[c(1:4, 10), "trim"] <- 1

# copying data to data_temp
data_temp <- data

# assigning some random value to data_temp so that it should no longer be a
# copy of "data"
data_temp[1, "random_value"] <- rnorm(1)

# converting data_temp to data.table
setDT(data_temp)

# expanding trim parameter to group and subsetting
data_temp <- data_temp[, trim := sum(trim), by = sequence][trim == 0]

data_temp comes out as expected with only the "C" sequence entries remaining. However, I would also expect the "data" object to remain unchanged. This is not the case. The "data" object looks as follows:

   sequence trim random_value
1         A    2           NA
2         A    2           NA
3         B    2           NA
4         B    2           NA
5         B    2           NA
6         C    0           NA
7         C    0           NA
8         C    0           NA
9         D    1           NA
10        D    1           NA

So the assignment by reference of the "trim" variable also happened in the original data.frame.

I am using data.table_1.11.4 and R version 3.4.3 for compatibility reasons.

Is the error a result of using old versions or am I doing something wrong / do I need to change the code to avoid that error?

Ah thanks. Good to know that it's also necessary to use copy() if the objects that I copy are not actually data.table objects but data.frames, only one of which will become a data.table later. — Phil
– Phil, Commented Jul 28, 2019 at 19:08
@Roland I was surprised to see that data_temp[1, "random_value"] <- rnorm(1) does not copy the entire data.frame, but only the "random_value" vector. So, after this line, the sequence and trim variables of the separate data.frames still point to the same objects in memory. I verified this with .Internal(inspect(.)). I wonder how long this behavior has been the default in base R. Maybe since lists were allowed to hold pointers? — lmo
– lmo, Commented Jul 28, 2019 at 19:19
@David. it is unclear this is a duplicate question. Although the advice of "create a copy before doing anything" will solve both issues, the copying behavior of -> differs for data.frame and data.table objects. You can see this by repeating matt dowle's example with data.framea and inspecting the memory location of the vectors. This would more accurately mirror the above situation. — lmo
– lmo, Commented Jul 29, 2019 at 14:38

Phil · Accepted Answer · 2019-07-28 19:40:05Z

As @Roland kindly pointed out in his comment to the original question, it's necessary to use the "copy()" function to explicitly copy objects in data.table. Otherwise data.table won't regard copied objects as distinct objects and will modify columns with the same name in both objects. As @Imo checked, only columns that are changed in just one of the two data.frames and not by reference (e.g. "random_value" in the example) are actually copied / unlinked.

The issue can be easily fixed by using the copy() function:

# creating a data.frame
data <- data.frame(sequence = rep(c("A","B","C","D"), c(2,3,3,2)), trim = 0, random_value = NA)
data[c(1:4, 10), "trim"] <- 1

# copying data to data_temp explicitly
data_temp <- copy(data)

# assigning some random value to data_temp so that it should no longer be a
# copy of "data" - if the copy() function isn't used, that just unlinks the 
# "random_value" column, but not the others
data_temp[1, "random_value"] <- rnorm(1)

# converting data_temp to data.table
setDT(data_temp)

# expanding trim parameter to group and subsetting
data_temp <- data_temp[, trim := sum(trim), by = sequence][trim == 0]

So it's necessary to use the copy() function every time you don't want data.table modifications by reference done on the copied tables affect the original table (or vice versa) - even if at the time you copy the tables they are not (yet) data.table class objects.

Collectives™ on Stack Overflow

data.table assignment by reference modifies wrong object

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related