6

I need to programmatically apply different functions to different columns and group by, using data.table.

If the columns and functions were known, I would do like this:

library(data.table)
DT = data.table(id = rep(letters[1:3], each=3),
                v1 = rep(c(2, 3, 4), each=3),
                v2 = rep(c(5, 10, 15), each=3))
DT
#>    id v1 v2
#> 1:  a  2  5
#> 2:  a  2  5
#> 3:  a  2  5
#> 4:  b  3 10
#> 5:  b  3 10
#> 6:  b  3 10
#> 7:  c  4 15
#> 8:  c  4 15
#> 9:  c  4 15
DT[, .(v1=mean(v1), v2=sum(v2)), keyby=.(id)]
#>    id v1 v2
#> 1:  a  2 15
#> 2:  b  3 30
#> 3:  c  4 45

But I want to do this by passing the column names and their specific function:

aggregate_functions = list(v1=mean, v2=sum)
col_selection = c('v1', 'v2')

I wrote something like this by I can't figure out a way of passing the column name to lapply:

DT[, lapply(.SD, 
            aggregate_functions[[col_name]] # some way of selecting the right function from aggregate_functions
            ), 
   .SDcols = col_selection, 
   by=id]

I have also tried with meltand dcast, but the latter applies all the functions to all the columns:

library(data.table)
DT = data.table(id = rep(letters[1:3], each=3),
                v1 = rep(c(2, 3, 4), each=3),
                v2 = rep(c(5, 10, 15), each=3))
DTm = melt(DT, meaure.vars=col_selection, id.vars='id')
DTm
#>     id variable value
#>  1:  a       v1     2
#>  2:  a       v1     2
#>  3:  a       v1     2
#>  4:  b       v1     3
#>  5:  b       v1     3
#>  6:  b       v1     3
#>  7:  c       v1     4
#>  8:  c       v1     4
#>  9:  c       v1     4
#> 10:  a       v2     5
#> 11:  a       v2     5
#> 12:  a       v2     5
#> 13:  b       v2    10
#> 14:  b       v2    10
#> 15:  b       v2    10
#> 16:  c       v2    15
#> 17:  c       v2    15
#> 18:  c       v2    15
DTc = dcast(DTm, id~variable, fun.aggregate=list(sum, mean))
DTc
#>    id value_sum_v1 value_sum_v2 value_mean_v1 value_mean_v2
#> 1:  a            6           15             2             5
#> 2:  b            9           30             3            10
#> 3:  c           12           45             4            15

I could programmatically select and rename the relevant columns (3 and 4 in this case) but it doesn't look like an efficient approach.

Of course I could have a loop doing the job and merging the results, but I am looking for a data.table way.

Thank you for your answer and thank you to the team at data.table.

Created on 2019-11-26 by the reprex package (v0.3.0)

0

2 Answers 2

1

An option is to use mapply:

DT[, mapply(function(f,x) as.list(f(x)), aggregate_functions, .SD), id, 
    .SDcols=col_selection]

Need to careful on the ordering of col_selection and aggregate_functions so that the right function is applied to the right column.

output:

   id v1 v2
1:  a  2 15
2:  b  3 30
3:  c  4 45

Edit from the OP:

Just to complete this brilliant solution. This solution works very well and if we replace col_selection with names(aggregate_functions) there is no issue with the ordering. Plus it automatically discards all the columns that are not in the list:

library(data.table)
DT = data.table(id = rep(letters[1:3], each=3),
                v1 = rep(c(2, 3, 4), each=3),
                v2 = rep(c(5, 10, 15), each=3),
                id2 = c(rep(c('cc', 'dd'), 4), 'dd')
                )
aggregate_functions = list(v1=mean, v2=sum)
DT[, mapply(function(f,x) as.list(f(x)), aggregate_functions, .SD), id, 
   .SDcols=names(aggregate_functions)]
#>    id v1 v2
#> 1:  a  2 15
#> 2:  b  3 30
#> 3:  c  4 45

It is also possible to use multiple variables to aggregate by, by passing a list:

DT[, mapply(function(f,x) as.list(f(x)), aggregate_functions, .SD), list(id, id2), 
   .SDcols=names(aggregate_functions)]
#>    id id2 v1 v2
#> 1:  a  cc  2 10
#> 2:  a  dd  2  5
#> 3:  b  dd  3 20
#> 4:  b  cc  3 10
#> 5:  c  cc  4 15
#> 6:  c  dd  4 30

Created on 2019-11-27 by the reprex package (v0.3.0)

Sign up to request clarification or add additional context in comments.

1 Comment

This is great! I have added a few things to the original solution, you can remove them if you don't like it. Thank you
1

After I posted the question, a link to this answer by @Uwe appeared in the right box that holds the results I am looking for. I tweaked it to match my example:

library(magrittr)
library(data.table)
DT = data.table(id = rep(letters[1:3], each=3),
                v1 = rep(c(2, 3, 4), each=3),
                v2 = rep(c(5, 10, 15), each=3))
aggregate_functions = list(v1='mean', v2='sum')
col_selection = c('v1', 'v2')
aggregate_functions %>%
  names() %>% 
  lapply(
    function(col_selection) lapply(
      aggregate_functions[[col_selection]], 
      function(.fct) sprintf("%s = %s(%s)", col_selection, .fct, col_selection))) %>% 
  unlist() %>% 
  paste(collapse = ", ") %>% 
  sprintf("DT[, .(%s), by = id]", .) %>% 
  parse(text = .) %>% 
  eval()
#>    id v1 v2
#> 1:  a  2 15
#> 2:  b  3 30
#> 3:  c  4 45

I would still be interested in 'all in data.table' solutions.

Created on 2019-11-26 by the reprex package (v0.3.0)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.