How can I pass column names through nested functions using `dplyr` in R?

Question

I am trying to write a function that, using tidyverse tools, will load two CSVs, join them on some key column, and write the joined result to a CSV. I am passing in the names of the key columns as character, and using double braces ({{ ... }}) to make the column name usable in the inner_join.

I have a function that works on the data frames. But I'm writing another function that takes filenames, loads the CSVs into data frames, and calls function 1. The "outer" function also assumes that the first column in the CSV is the key column, and passes that column name to the inner function.

The problem is that unless the column name is hard-coded, the code breaks. It appears that the outer function is passing in the argument unevaluated. If I read the column name into a variable, I get an error message like this:

Error in `inner_join()`:
! Join columns in `x` must be present in the data.
✖ Problem with `df1_key`.
Run `rlang::last_trace()` to see where the error occurred.

If I use colnames(df_loaded_from_csv)[1] as the function argument, I get an error message like this:

Error in `join_by()`:
! Expressions can't contain computed columns, and can only reference columns by name or by explicitly specifying
  a side, like `x$col` or `y$col`.
ℹ Expression 1 contains `colnames(df1)[1]`.
Run `rlang::last_trace()` to see where the error occurred.

In each case, it appears that the variable or the expression is not being evaluated, and is ending up in as an argument to the inner_join in the inner function. I can't even begin to understand how this is even possible.

Here is a small reproducible example:

library(tidyverse)

# Create example CSVs
df1 = tibble(
  key = LETTERS,
  value = sample.int(100, 26)
)
write_csv(df1, "df1.csv")

df2 = tibble(
  key = LETTERS[1:13],
  value = sample.int(100, 13)
)
write_csv(df2, "df2.csv")

# Function to join two data frames and return the joined data frame
join_dfs = function(x, y, x_key = "key", y_key = "key") {
  
  df = x |>
    inner_join(y, by = join_by({{ x_key }} == {{ y_key }} ))
  
}

# This works:
df3 = join_dfs(df1, df2, "key", "key")

# Function to load to CSVs and call join_dfs on the loaded data frames
join_csvs = function(x, y, file) {
  df1 = read_csv(x)
  df1_key = colnames(df1)[1]
  print(df1_key)
  
  df2 = read_csv(y)
  df2_key = colnames(df2)[1]
  print(df2_key)
  
  df3 = join_dfs(df1, df2, df1_key, df2_key)
  write_csv(df3, file)
  
}

# This fails:
join_csvs("df1.csv", "df2.csv", "df3.csv")

How can I get this to work?

Edward · Accepted Answer · 2025-04-11 07:09:59Z

4

Try !! (bang bang) instead. I'm not sure why {} (curly curly) doesn't work using the second method.

join_dfs = function(x, y, x_key = "key", y_key = "key") {
  df <- x |>
    inner_join(y, by = join_by(!!x_key == !!y_key))
}

As an aside, it's recommended that a function returns something (usually placed on the last line but that's not strict) instead of assigning the final result to an object that will never be used (df).

join_dfs = function(x, y, x_key = "key", y_key = "key") {
  df <- x |>
    inner_join(y, by = join_by(!!x_key == !!y_key))
  df
}

Without the returning object on the last line, this command

join_dfs(df1, df2, "key", "key")

doesn't print anything. See this post for more details on that.

And I would simplify the function by just doing:

join_dfs = function(x, y, x_key = "key", y_key = "key") {
  x |>
    inner_join(y, by = join_by(!!x_key == !!y_key))
}

edited Apr 11 at 7:09

answered Apr 11 at 4:56

Edward

22.2k3 gold badges18 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Lee Hachadoorian Apr 11 at 20:12

"As an aside, it's recommended that a function returns something (usually placed on the last line but that's not strict) instead of assigning the final result to an object that will never be used (df)." Interesting. That was an oversight on my part, but the function works as written. df still get returned and assigned to df3.

Edward Apr 12 at 3:49

Yes, but you don’t need to assign the result to ‘df’ inside the function.

Lee Hachadoorian Apr 12 at 15:46

Point taken, and I've updated my code to match yours. I was just pointing out that even with the assignment, the result of the operation was returned by the function.

KacZdr · Accepted Answer · 2025-04-11 08:24:44Z

3

The other option is to change the inner function to accept strings directly.

Assuming that you always plan to pass the key column names as character strings, we can use setNames :

join_dfs <- function(x, y, x_key = "key", y_key = "key") {
  x %>% 
    inner_join(y, by = setNames(y_key, x_key))
}

answered Apr 11 at 8:24

KacZdr

1,4513 gold badges13 silver badges28 bronze badges

1 Comment

Lee Hachadoorian Apr 11 at 20:51

This works also, and I have upvoted, but selected the other answer as the correct answer.

Collectives™ on Stack Overflow

How can I pass column names through nested functions using `dplyr` in R?

2 Answers 2

3 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related