1

Do I need to use copy() inside a function to avoid undesired modification of the input data.table?

For example

myfun <- function(mydata) {   
     mydata[,newcolumn := .N,by=id]   
     setnames(mydata, "newcolumn", "Count")
     return(table(mydata$Count))
}

or

myfun <- function(mydata) {   
     temp <- copy(mydata)
     temp[,newcolumn := .N,by=id]   
     setnames(temp, "newcolumn", "Count")
     return(table(temp$Count))
}

Or does passing the data.table to the function already creates a copy even if I assign things with:=?

10
  • 2
    Maybe related: Understanding exactly when a data.table is a reference to (vs a copy of) another data.table Commented Feb 26, 2018 at 19:04
  • Re the last question, no it does not create a copy on its own. I think they want to export the shallow function eventually that will make this copy less wasteful github.com/Rdatatable/data.table/issues/2323 Also relevant stackoverflow.com/a/45925735 Commented Feb 26, 2018 at 19:04
  • In short, with copy(mydata) the original table stays unaffected. If that's what you want, then it's advised to copy the data table to another. So, in second function, the newColumn gets created in temp while the mydata table remains unaffected. Commented Feb 26, 2018 at 19:07
  • @ManishSaraswat but does the first function affect the original data.table even if it's inside a function? I've been trying and it doesn't seem to, but I'm afraid it could produce unexpected results Commented Feb 26, 2018 at 19:28
  • @skan no, it won't affect in the first function as well. I forgot to notice, since it's inside the function, it won't affect the data.table globally. or does it? Commented Feb 26, 2018 at 19:52

1 Answer 1

2

The linked answer from @Henrik to https://stackoverflow.com/a/10226454/4468078 does explain all details to answer your question.

This (modified) version of your example function does not modify the passed data.table:

library(data.table)
dt <- data.table(id = 1:4, a = LETTERS[1:4])
myfun2 <- function(mydata) {   
  x <- mydata[, .(newcolumn = .N), by=id]
  setnames(x, "newcolumn", "Count")
  return(table(x$Count))
}
myfun2(dt)

This does not copy the whole data.table (which would be a waste of RAM and CPU time) but only writes the result of the aggregation into a new data.table which you can modify without side effects (= no changes of the original data.table).

> str(dt)
Classes ‘data.table’ and 'data.frame':  4 obs. of  2 variables:
 $ id: int  1 2 3 4
 $ a : chr  "A" "B" "C" "D"

A data.table is always passed by reference to a function so you have to be careful not to modify it unless you are absolutely sure you want to do this.

The data.table package was designed exactly for this efficient way of modifying data without the usual "COW" ("copy on (first) write") principle to support efficient data manipulation.

"Dangerous" operations that modify a data.table are mainly:

  • := assignment to modify or create a new column "in-place"
  • all set* functions

If you don't want to modify a data.table you can use just row filters, and column (selection) expressions (i, j, by etc. arguments).

Chaining does also prevent the modification of the original data.frame if you modify "by ref" in the second (or later) chain:

myfun3 <- function(mydata) {
  # chaining also creates a copy 
  return(mydata[id < 3,][, a := "not overwritten outside"])
}

myfun3(dt)
# > str(dt)
# Classes ‘data.table’ and 'data.frame':    4 obs. of  3 variables:
# $ id: int  1 2 3 4
# $ a : chr  "A" "B" "C" "D"
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.