Special characters!
Using regular expressions with plain words is generally fine (aside from efficiency concerns), there will however be an issue when you have special characters. This is an often overlooked issue and I've seen many people not understanding why their str.replace failed.
Pandas even changed the default regex=True to regex=False, and the original reason for that (#GH24804) was that str.replace('.', '') would remove all characters, which is expected if you know regex, but not at all if you don't.
For example, let's try to replace 1.5 with 2.3 and the $ currency by £:
df = pd.DataFrame({'input': ['This item costs $1.5.', 'We need 125 units.']})
df['out1'] = df['input'].str.replace('1.5', '2.3', regex=False)
df['out1_regex'] = df['input'].str.replace('1.5', '2.3', regex=True)
df['out2'] = df['input'].str.replace('$', '£', regex=False)
df['out2_regex'] = df['input'].str.replace('$', '£', regex=True)
Output:
input out1 out1_regex \
0 This item costs $1.5. This item costs $2.3. This item costs $2.3.
1 We need 125 units. We need 125 units. We need 2.3 units.
out2 out2_regex
0 This item costs £1.5. This item costs $1.5.£
1 We need 125 units. We need 125 units.£
Since . and $ have a special meaning in a regex, those cannot be used as is and should have been escaped (1\.5 / \$), which can be done programmatically with re.escape.
How does str.replace decide to use a regex or a plain string operation?
a pure python replacement will be used if:
regex=False
pat is a string (passing a compiled regex with regex=False will trigger a ValueError)
case is not False
- no flags are set
repl is not a callable
In all other cases, re.sub will be used.
The code that does this is is core/strings/object_array.py:
def _str_replace(
self,
pat: str | re.Pattern,
repl: str | Callable,
n: int = -1,
case: bool = True,
flags: int = 0,
regex: bool = True,
):
if case is False:
# add case flag, if provided
flags |= re.IGNORECASE
if regex or flags or callable(repl):
if not isinstance(pat, re.Pattern):
if regex is False:
pat = re.escape(pat)
pat = re.compile(pat, flags=flags)
n = n if n >= 0 else 0
f = lambda x: pat.sub(repl=repl, string=x, count=n)
else:
f = lambda x: x.replace(pat, repl, n)
Efficiency
Considering a pattern without special characters, regex=True is about 6 times slower than regex=False in the linear regime:
