How to write a function with tidy eval when using the "arrow" R package (arrow::open_dataset) and dplyr verbs?

Question

What I'm trying to do

I'm attempting to write a function that uses dplyr verbs and that takes an "arrow open dataset" as the first argument, and a column in that dataset as the second argument. Since I would like to pass the column as a string (necessary for the context of my actual task I'm working on, i.e. Shiny), I'm using the syntax .data[[.column]]. Below is an image of the error I'm getting and some code to reproduce said error. Any help or insight is appreciated.

Image of error message

Code to reproduce error

# install.packages(c("dplyr", "ggplot2", "arrow"))
library(dplyr)

arrow::write_parquet(x = ggplot2::mpg, sink = "sample_data.parquet")

dat <- arrow::open_dataset("sample_data.parquet")

glimpse(dat)

get_metric <- function(.data, .metric) {
  
  .data %>%
    group_by(manufacturer, cyl) %>% 
    summarize(
      new_col = sum(.data[[.metric]], na.rm = T)
    ) %>% 
    ungroup() 
}

get_metric(dat, "cty") %>% collect()

Additional code that works but doesn't use arrow as much so not ideal for speed

In this code I collect before the tidy eval stuff so its just essentially regular dplyr code. It runs, but is a slower than code that I've successfully gotten to run before extracting stuff into said function.

get_metric2 <- function(.data, .metric) {
  
  .data %>%
    collect() %>% 
    group_by(manufacturer, cyl) %>% 
    summarize(
      new_col = sum(.data[[.metric]], na.rm = T)
    ) %>% 
    ungroup() 
}

get_metric2(dat, "cty")

If you're hoping to do this programmatically, is it safe to harc-code manufacturer,cyl in your function? — r2evans
– r2evans, Commented Feb 23, 2024 at 18:19
Yeah that's a fair point, thank you. I was just trying to keep the sample code as simple as possible. Do you have any idea of a solution to the actual problem though? Per your suggestion, the second function be written like this: get_metric2 <- function(.data, .metric, ...) { .data %>% collect() %>% group_by(...) %>% summarize( new_col = sum(.data[[.metric]], na.rm = T) ) %>% ungroup() } get_metric2(dat, "cty", manufacturer, cyl) — Avery Robbins
– Avery Robbins, Commented Feb 23, 2024 at 18:24

r2evans · Accepted Answer · 2024-02-23 19:18:44Z

2

Use the !! nomenclature.

arrow::write_parquet(x = ggplot2::mpg, sink = "sample_data.parquet")
dat <- arrow::open_dataset("sample_data.parquet")

get_metric <- function(.data, .metric) {
  .metric <- rlang::sym(.metric)
  .data %>%
    group_by(manufacturer, cyl) %>% 
    summarize(
      new_col = sum(!!.metric, na.rm = T)
    ) %>% 
    ungroup() 
}

get_metric(dat, "cty") %>%
  collect()
# # A tibble: 32 × 3
#    manufacturer   cyl new_col
#    <chr>        <int>   <int>
#  1 audi             4     153
#  2 audi             6     148
#  3 audi             8      16
#  4 chevrolet        8     191
#  5 chevrolet        4      41
#  6 chevrolet        6      53
#  7 dodge            4      18
#  8 dodge            6     225
#  9 dodge            8     243
# 10 ford             8     197
# # ℹ 22 more rows
# # ℹ Use `print(n = ...)` to see more rows

edited Feb 23, 2024 at 19:18

answered Feb 23, 2024 at 18:25

r2evans

167k8 gold badges92 silver badges176 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Avery Robbins Over a year ago

Thank you for this answer. "cty" does exist btw. It's right there in your names(dat). Any ideas on why the "more new" syntax of .data[[.metric]] doesn't work, but the older !! does?

r2evans Over a year ago

The orchestration of lazy expressions into Arrow-friendly inner code is not as fast-paced or fully-extended to the lazy-data-variants that dplyr has introduced, I suspect that rlang's !! is handled before being handed to arrow, which makes it a lot more extensible and flexible in that regard.

Avery Robbins Over a year ago

Not at all, you helped me make the code work properly. Pretty funny though.

Avery Robbins Over a year ago

Your explanation makes sense. Thanks again r2evans.

LMc Over a year ago

r2evans, Yeah, I'm a dummy -- what a ridiculous thing to say about r2evans :)

|

Collectives™ on Stack Overflow

How to write a function with tidy eval when using the "arrow" R package (arrow::open_dataset) and dplyr verbs?

What I'm trying to do

Image of error message

Code to reproduce error

Additional code that works but doesn't use arrow as much so not ideal for speed

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

What I'm trying to do

Image of error message

Code to reproduce error

Additional code that works but doesn't use arrow as much so not ideal for speed

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related