3

I am trying to write a function that, using tidyverse tools, will load two CSVs, join them on some key column, and write the joined result to a CSV. I am passing in the names of the key columns as character, and using double braces ({{ ... }}) to make the column name usable in the inner_join.

I have a function that works on the data frames. But I'm writing another function that takes filenames, loads the CSVs into data frames, and calls function 1. The "outer" function also assumes that the first column in the CSV is the key column, and passes that column name to the inner function.

The problem is that unless the column name is hard-coded, the code breaks. It appears that the outer function is passing in the argument unevaluated. If I read the column name into a variable, I get an error message like this:

Error in `inner_join()`:
! Join columns in `x` must be present in the data.
✖ Problem with `df1_key`.
Run `rlang::last_trace()` to see where the error occurred.

If I use colnames(df_loaded_from_csv)[1] as the function argument, I get an error message like this:

Error in `join_by()`:
! Expressions can't contain computed columns, and can only reference columns by name or by explicitly specifying
  a side, like `x$col` or `y$col`.
ℹ Expression 1 contains `colnames(df1)[1]`.
Run `rlang::last_trace()` to see where the error occurred.

In each case, it appears that the variable or the expression is not being evaluated, and is ending up in as an argument to the inner_join in the inner function. I can't even begin to understand how this is even possible.

Here is a small reproducible example:

library(tidyverse)

# Create example CSVs
df1 = tibble(
  key = LETTERS,
  value = sample.int(100, 26)
)
write_csv(df1, "df1.csv")

df2 = tibble(
  key = LETTERS[1:13],
  value = sample.int(100, 13)
)
write_csv(df2, "df2.csv")

# Function to join two data frames and return the joined data frame
join_dfs = function(x, y, x_key = "key", y_key = "key") {
  
  df = x |>
    inner_join(y, by = join_by({{ x_key }} == {{ y_key }} ))
  
}

# This works:
df3 = join_dfs(df1, df2, "key", "key")

# Function to load to CSVs and call join_dfs on the loaded data frames
join_csvs = function(x, y, file) {
  df1 = read_csv(x)
  df1_key = colnames(df1)[1]
  print(df1_key)
  
  df2 = read_csv(y)
  df2_key = colnames(df2)[1]
  print(df2_key)
  
  df3 = join_dfs(df1, df2, df1_key, df2_key)
  write_csv(df3, file)
  
}

# This fails:
join_csvs("df1.csv", "df2.csv", "df3.csv")

How can I get this to work?

2 Answers 2

4

Try !! (bang bang) instead. I'm not sure why {} (curly curly) doesn't work using the second method.

join_dfs = function(x, y, x_key = "key", y_key = "key") {
  df <- x |>
    inner_join(y, by = join_by(!!x_key == !!y_key))
}

As an aside, it's recommended that a function returns something (usually placed on the last line but that's not strict) instead of assigning the final result to an object that will never be used (df).

join_dfs = function(x, y, x_key = "key", y_key = "key") {
  df <- x |>
    inner_join(y, by = join_by(!!x_key == !!y_key))
  df
}

Without the returning object on the last line, this command

join_dfs(df1, df2, "key", "key")

doesn't print anything. See this post for more details on that.

And I would simplify the function by just doing:

join_dfs = function(x, y, x_key = "key", y_key = "key") {
  x |>
    inner_join(y, by = join_by(!!x_key == !!y_key))
}
Sign up to request clarification or add additional context in comments.

3 Comments

"As an aside, it's recommended that a function returns something (usually placed on the last line but that's not strict) instead of assigning the final result to an object that will never be used (df)." Interesting. That was an oversight on my part, but the function works as written. df still get returned and assigned to df3.
Yes, but you don’t need to assign the result to ‘df’ inside the function.
Point taken, and I've updated my code to match yours. I was just pointing out that even with the assignment, the result of the operation was returned by the function.
3

The other option is to change the inner function to accept strings directly.

Assuming that you always plan to pass the key column names as character strings, we can use setNames :

join_dfs <- function(x, y, x_key = "key", y_key = "key") {
  x %>% 
    inner_join(y, by = setNames(y_key, x_key))
}

1 Comment

This works also, and I have upvoted, but selected the other answer as the correct answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.