1

I am trying to use regex in R to replace forbidden characters. I know base::make.names can do some of this, but I want to control the replacement character. I have successfully figured out how to do this for Windows file name forbidden characters, as long as \ is expanded to accommodate the needs of gsub:

wdc <- c(
  "<", ">", ":", '"', "/", "\\", "|", "?", "*"
) %>% 
  magrittr::set_names(., .)
wdc_regex <- wdc %>% 
  sub("\\\\", "\\\\\\\\", .) %>%
  c("[", ., "]") %>%
  paste0(collapse = "")
wdc %>% 
  c(letters[1:5]) %>%
  gsub(wdc_regex, "_", ., perl = TRUE)
  <   >   :   "   /  \\   |   ?   *                     
"_" "_" "_" "_" "_" "_" "_" "_" "_" "a" "b" "c" "d" "e" 
wdc_regex <- wdc %>% 
  c("[", ., "]") %>%
  paste0(collapse = "")
wdc %>% 
  c(letters[1:5]) %>%
  gsub(wdc_regex, "_", ., perl = TRUE)
   <    >    :    "    /   \\    |    ?    *                          
 "_"  "_"  "_"  "_"  "_" "\\"  "_"  "_"  "_"  "a"  "b"  "c"  "d"  "e" 

However, when I use the same strategy for characters that don't work with syntactically valid names in R, I run into a number of issues I don't understand.

  1. No modification needed to replace \: The code for Windows characters requires the call to sub("\\\\", "\\\\\\\\", .) in order to replace \ with _. However, the code below works without this step. Why is it not necessary to expand \ in the code below?
rdc_test <- c(
  "<", ">", ":", '"', "/", "\\", "|", "?", "*", "~", ",", ";", "+", "-", "`", 
  "!", "@", "#", "$", "%", "^", "&", "=", "(", ")", "'", "{", "}"
  , "[", "]"
) %>% 
  magrittr::set_names(., .)
rdc <- c(
  "<", ">", ":", '"', "/", "\\", "|", "?", "*", "~", ",", ";", "+", "-", "`", 
  "!", "@", "#", "$", "%", "^", "&", "=", "(", ")", "'", "{", "}"
  # , "[", "]"
)
rdc_regex <- rdc %>% 
  sub("\\\\", "\\\\\\\\", .) %>%
  c("[", ., "]") %>%
  paste0(collapse = "")
rdc_test %>% 
  c(letters[1:5]) %>%
  gsub(rdc_regex, "_", ., perl = TRUE)
  <   >   :   "   /  \\   |   ?   *   ~   ,   ;   +   -   `   !   @   #   $   %   ^   &   =   (   )   '   {   }   [   ]                     
"_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "a" "b" "c" "d" "e" 
rdc_regex <- rdc %>% 
  c("[", ., "]") %>%
  paste0(collapse = "")
rdc_test %>% 
  c(letters[1:5]) %>%
  gsub(rdc_regex, "_", ., perl = TRUE)
  <   >   :   "   /  \\   |   ?   *   ~   ,   ;   +   -   `   !   @   #   $   %   ^   &   =   (   )   '   {   }   [   ]                     
"_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "_" "a" "b" "c" "d" "e" 
  1. Square brackets [ and ] replaced even when not included in regex: in the code above, the square brackets characters [ and ] are not included in the character set defined in rdc_regex, except as delimiters of the character set (Regex match any single character (one character only)). However, the square brackets are still replaced. How is this happening?
rdc_regex
[1] "[<>:\"/\\|?*~,;+-`!@#$%^&=()'{}]"

Solution Based on the comment from @Wiktor Stribiżew, it appears the issue was an unescaped - was acting as a range operator in my character class. Thus, +-backtick matched every character "between" + and backtick in my local character list. This reference (https://perldoc.perl.org/perlrequick) says the special characters inside a character class are -]\^$, so I've escaped all of them in the code below. I'm not sure if this is overkill, but it is currently working.

rdc_test <- c(
  "<", ">", ":", '"', "/", "\\", "|", "?", "*", "~", ",", ";", "+", "-", "`", 
  "!", "@", "#", "$", "%", "^", "&", "=", "(", ")", "'", "{", "}"
  , "[", "]"
) %>% 
  magrittr::set_names(., .)
rdc <- c(
  "<", ">", ":", '"', "/", "\\", "|", "?", "*", "~", ",", ";", "+", "\\\\-", "`", 
  "!", "@", "#", "\\\\$", "%", "\\\\^", "&", "=", "(", ")", "'", "{", "}"
  , "[", "\\\\]"
)
rdc_regex <- rdc %>% 
  sub("\\\\", "\\\\\\\\", .) %>%
  c("[", ., "]") %>%
  paste0(collapse = "")
rdc_test %>% 
  c(letters[1:5]) %>%
  gsub(rdc_regex, "_", ., perl = TRUE)
1
  • 5
    You have an unescaped - in your character class. It must be "[<>:\"/\\\\|?*~,;+\\-`!@#$%^&=()'{}]" Commented Mar 20 at 21:27

1 Answer 1

5

Use chartr. Also note that R supports r"{...}" notation for string constants in which case escapes are ignored. Note comment below answer pointing out that if - is used in bad it should be put at the end since it has a special meaning (denoting a range) if used between characters.

bad <- r"{"<>:\"/\|?*}"
chartr(bad, strrep("_", nchar(bad)), r"{x":[\y}")
## [1] "x__[_y"

This variation also works:

chartr(bad, gsub(".", "_", bad), r"{x":[\y}")
## [1] "x__[_y"
Sign up to request clarification or add additional context in comments.

1 Comment

This is much easier than what I was trying. Of note, an unescaped - in the middle of bad acts as a range specifier (e.g., can cause "Error in chartr(...) : decreasing range specification ('+-*')"). I put the - at the end of r"{,:;+*/~``!@#$%^&=<>?()[]{}|"'-}" and now this strategy works perfectly. Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.