2

What I'm trying to do

I'm attempting to write a function that uses dplyr verbs and that takes an "arrow open dataset" as the first argument, and a column in that dataset as the second argument. Since I would like to pass the column as a string (necessary for the context of my actual task I'm working on, i.e. Shiny), I'm using the syntax .data[[.column]]. Below is an image of the error I'm getting and some code to reproduce said error. Any help or insight is appreciated.

Image of error message

enter image description here

Code to reproduce error

# install.packages(c("dplyr", "ggplot2", "arrow"))
library(dplyr)

arrow::write_parquet(x = ggplot2::mpg, sink = "sample_data.parquet")

dat <- arrow::open_dataset("sample_data.parquet")

glimpse(dat)

get_metric <- function(.data, .metric) {
  
  .data %>%
    group_by(manufacturer, cyl) %>% 
    summarize(
      new_col = sum(.data[[.metric]], na.rm = T)
    ) %>% 
    ungroup() 
}

get_metric(dat, "cty") %>% collect()

Additional code that works but doesn't use arrow as much so not ideal for speed

In this code I collect before the tidy eval stuff so its just essentially regular dplyr code. It runs, but is a slower than code that I've successfully gotten to run before extracting stuff into said function.

get_metric2 <- function(.data, .metric) {
  
  .data %>%
    collect() %>% 
    group_by(manufacturer, cyl) %>% 
    summarize(
      new_col = sum(.data[[.metric]], na.rm = T)
    ) %>% 
    ungroup() 
}

get_metric2(dat, "cty")
2
  • If you're hoping to do this programmatically, is it safe to harc-code manufacturer,cyl in your function? Commented Feb 23, 2024 at 18:19
  • 1
    Yeah that's a fair point, thank you. I was just trying to keep the sample code as simple as possible. Do you have any idea of a solution to the actual problem though? Per your suggestion, the second function be written like this: get_metric2 <- function(.data, .metric, ...) { .data %>% collect() %>% group_by(...) %>% summarize( new_col = sum(.data[[.metric]], na.rm = T) ) %>% ungroup() } get_metric2(dat, "cty", manufacturer, cyl) Commented Feb 23, 2024 at 18:24

1 Answer 1

2

Use the !! nomenclature.

arrow::write_parquet(x = ggplot2::mpg, sink = "sample_data.parquet")
dat <- arrow::open_dataset("sample_data.parquet")

get_metric <- function(.data, .metric) {
  .metric <- rlang::sym(.metric)
  .data %>%
    group_by(manufacturer, cyl) %>% 
    summarize(
      new_col = sum(!!.metric, na.rm = T)
    ) %>% 
    ungroup() 
}

get_metric(dat, "cty") %>%
  collect()
# # A tibble: 32 × 3
#    manufacturer   cyl new_col
#    <chr>        <int>   <int>
#  1 audi             4     153
#  2 audi             6     148
#  3 audi             8      16
#  4 chevrolet        8     191
#  5 chevrolet        4      41
#  6 chevrolet        6      53
#  7 dodge            4      18
#  8 dodge            6     225
#  9 dodge            8     243
# 10 ford             8     197
# # ℹ 22 more rows
# # ℹ Use `print(n = ...)` to see more rows
Sign up to request clarification or add additional context in comments.

7 Comments

Thank you for this answer. "cty" does exist btw. It's right there in your names(dat). Any ideas on why the "more new" syntax of .data[[.metric]] doesn't work, but the older !! does?
The orchestration of lazy expressions into Arrow-friendly inner code is not as fast-paced or fully-extended to the lazy-data-variants that dplyr has introduced, I suspect that rlang's !! is handled before being handed to arrow, which makes it a lot more extensible and flexible in that regard.
Not at all, you helped me make the code work properly. Pretty funny though.
Your explanation makes sense. Thanks again r2evans.
r2evans, Yeah, I'm a dummy -- what a ridiculous thing to say about r2evans :)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.