2

[ This is also reported on the multidplyr github page ]

I'm trying to use multidplyr_0.0.0.9000 with dplyr_0.7.4.9000 and pmap_dfr from purrr_0.2.4.9000. The following code (without using multidplyr) works fine:

grid1 = as_tibble(expand.grid(m1 = c(1:10), m2 = c(20:30)))
retstuff = function(m1, m2) { return(tribble(~m3, ~m4, m1+1, m2+2)) }
pmap_dfr(grid1, retstuff)

When I try to partition the grid with multidplyr:

grid2 = partition(grid1, m1)
pmap_dfr(grid2, retstuff)

I get the error Error: Element 5 is not a vector (environment) from pmap_dfr()

I also get the following warning from partition() as also reported on github: group_indices_.grouped_df ignores extra arguments. Not sure if that's related or not.

1 Answer 1

3

A few issues:

  • You need to load any necessary packages (beyond dplyr) on each node,
  • You need to copy your function to each node, and
  • You can only call dplyr verbs on the partitioned data frame, so you need to wrap the pmap_dfr call in dplyr::do

after which it works:

library(tidyverse)
library(multidplyr)

grid1 <- as_tibble(expand.grid(m1 = c(1:10), m2 = c(20:30)))
retstuff <- function(m1, m2) { 
    tribble(   ~m3,    ~m4, 
            m1 + 1, m2 + 2)
}

grid2 <- partition(grid1, m1)
#> Initialising 7 core cluster.
#> Warning: group_indices_.grouped_df ignores extra arguments
cluster_library(grid2, 'tidyverse')    # load packages on each node
cluster_copy(grid2, retstuff)    # copy function to each node

grid2 %>% do(pmap_dfr(., retstuff))    # wrap call in dplyr::do
#> Source: party_df [110 x 3]
#> Groups: m1
#> Shards: 7 [11--22 rows]
#> 
#> # S3: party_df
#>       m1    m3    m4
#>    <int> <dbl> <dbl>
#>  1     9    10    22
#>  2     9    10    23
#>  3     9    10    24
#>  4     9    10    25
#>  5     9    10    26
#>  6     9    10    27
#>  7     9    10    28
#>  8     9    10    29
#>  9     9    10    30
#> 10     9    10    31
#> # ... with 100 more rows

...but for this particular case, while multidplyr is a little faster, plain dplyr::mutate is quite a lot faster yet, and a lot easier to write:

grid1 %>% mutate(m3 = m1 + 1, m4 = m2 + 2)
#> # A tibble: 110 x 4
#>       m1    m2    m3    m4
#>    <int> <int> <dbl> <dbl>
#>  1     1    20     2    22
#>  2     2    20     3    22
#>  3     3    20     4    22
#>  4     4    20     5    22
#>  5     5    20     6    22
#>  6     6    20     7    22
#>  7     7    20     8    22
#>  8     8    20     9    22
#>  9     9    20    10    22
#> 10    10    20    11    22
#> # ... with 100 more rows

all.equal(grid2 %>% do(pmap_dfr(., retstuff)) %>% collect, 
          grid1 %>% mutate(m3 = m1 + 1, m4 = m2 + 2) %>% select(-m2))
#> [1] TRUE

microbenchmark::microbenchmark(
    multidplyr_pmap = grid2 %>% do(pmap_dfr(., retstuff)) %>% collect(), 
    multidplyr_mutate = grid2 %>% mutate(m3 = m1 + 1, m4 = m2 + 2) %>% collect(),
    pmap = grid1 %>% pmap_dfr(retstuff),
    mutate = grid1 %>% mutate(m3 = m1 + 1, m4 = m2 + 2) %>% select(-m2)
)
#> Unit: milliseconds
#>               expr        min        lq       mean    median         uq       max neval
#>    multidplyr_pmap 113.896646 117.18365 122.656286 119.75652 125.874450 182.53330   100
#>  multidplyr_mutate  12.419918  12.84528  16.271337  13.68441  15.092482 177.77372   100
#>               pmap 372.512544 387.49371 397.844622 394.71971 402.640281 551.78633   100
#>             mutate   7.014426   7.49689   8.499588   7.66554   8.654478  32.22647   100  
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, that did the trick! Regarding performance, I was simplifying my actual code down to something that still exhibited the error and kept a basic skeleton of the approach. The real grid1 is a lot bigger and the real retstuff() is more complex :-)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.