I am trying to use regex in R to replace forbidden characters. I know
base::make.names can do some of this, but I want to control the replacement
character. I have successfully figured out how to do this for Windows file name
forbidden characters, as long as \ is expanded to accommodate the needs of
gsub:
wdc <- c(
"<", ">", ":", '"', "/", "\\", "|", "?", "*"
) %>%
magrittr::set_names(., .)
wdc_regex <- wdc %>%
sub("\\\\", "\\\\\\\\", .) %>%
c("[", ., "]") %>%
paste0(collapse = "")
wdc %>%
c(letters[1:5]) %>%
gsub(wdc_regex, "_", ., perl = TRUE)
< > : " / \\ | ? *
"_" "_" "_" "_" "_" "_" "_" "_" "_" "a" "b" "c" "d" "e"
wdc_regex <- wdc %>%
c("[", ., "]") %>%
paste0(collapse = "")
wdc %>%
c(letters[1:5]) %>%
gsub(wdc_regex, "_", ., perl = TRUE)
< > : " / \\ | ? *
"_" "_" "_" "_" "_" "\\" "_" "_" "_" "a" "b" "c" "d" "e"
However, when I use the same strategy for characters that don't work with syntactically valid names in R, I run
into a number of issues I don't understand.
- No modification needed to replace
\: The code for Windows characters requires the call tosub("\\\\", "\\\\\\\\", .)in order to replace\with_. However, the code below works without this step. Why is it not necessary to expand\in the code below?
rdc_test <- c(
"<", ">", ":", '"', "/", "\\", "|", "?", "*", "~", ",", ";", "+", "-", "`",
"!", "@", "#", "$", "%", "^", "&", "=", "(", ")", "'", "{", "}"
, "[", "]"
) %>%
magrittr::set_names(., .)
rdc <- c(
"<", ">", ":", '"', "/", "\\", "|", "?", "*", "~", ",", ";", "+", "-", "`",
"!", "@", "#", "$", "%", "^", "&", "=", "(", ")", "'", "{", "}"
# , "[", "]"
)
rdc_regex <- rdc %>%
sub("\\\\", "\\\\\\\\", .) %>%
c("[", ., "]") %>%
paste0(collapse = "")
rdc_test %>%
c(letters[1:5]) %>%
gsub(rdc_regex, "_", ., perl = TRUE)
< > : " / \\ | ? * ~ , ; + - ` ! @ # $ % ^ & = ( ) ' { } [ ]
"_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "a" "b" "c" "d" "e"
rdc_regex <- rdc %>%
c("[", ., "]") %>%
paste0(collapse = "")
rdc_test %>%
c(letters[1:5]) %>%
gsub(rdc_regex, "_", ., perl = TRUE)
< > : " / \\ | ? * ~ , ; + - ` ! @ # $ % ^ & = ( ) ' { } [ ]
"_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "a" "b" "c" "d" "e"
- Square brackets
[and]replaced even when not included inregex: in the code above, the square brackets characters[and]are not included in the character set defined inrdc_regex, except as delimiters of the character set (Regex match any single character (one character only)). However, the square brackets are still replaced. How is this happening?
rdc_regex
[1] "[<>:\"/\\|?*~,;+-`!@#$%^&=()'{}]"
Solution
Based on the comment from @Wiktor Stribiżew, it appears the issue was an unescaped - was acting as a range operator in my character class. Thus, +-backtick matched every character "between" + and backtick in my local character list. This reference (https://perldoc.perl.org/perlrequick) says the special characters inside a character class are -]\^$, so I've escaped all of them in the code below. I'm not sure if this is overkill, but it is currently working.
rdc_test <- c(
"<", ">", ":", '"', "/", "\\", "|", "?", "*", "~", ",", ";", "+", "-", "`",
"!", "@", "#", "$", "%", "^", "&", "=", "(", ")", "'", "{", "}"
, "[", "]"
) %>%
magrittr::set_names(., .)
rdc <- c(
"<", ">", ":", '"', "/", "\\", "|", "?", "*", "~", ",", ";", "+", "\\\\-", "`",
"!", "@", "#", "\\\\$", "%", "\\\\^", "&", "=", "(", ")", "'", "{", "}"
, "[", "\\\\]"
)
rdc_regex <- rdc %>%
sub("\\\\", "\\\\\\\\", .) %>%
c("[", ., "]") %>%
paste0(collapse = "")
rdc_test %>%
c(letters[1:5]) %>%
gsub(rdc_regex, "_", ., perl = TRUE)
-in your character class. It must be"[<>:\"/\\\\|?*~,;+\\-`!@#$%^&=()'{}]"