3

Suppose I have a dataframe in which I want to replace a non-regex substring consisting only of characters (i.e. a-z, A-Z) and/or digits (i.e. 0-9) via pd.Series.str.replace. The docs state that this function is equivalent to str.replace or re.sub(), depending on the regex argument (default False).

Apart from most likely being overkill, are there any downsides to consider if the function was called with regex=True for non-regex replacements (e.g. performance)? If so, which ones? Of course, I am not suggesting using the function in this way.

Example: Replace 'Elephant' in the below dataframe.

import pandas as pd

data = {'Animal_Name': ['Elephant African', 'Elephant Asian', 'Elephant Indian', 'Elephant Borneo', 'Elephant Sumatran']}
df = pd.DataFrame(data)

df = df['Animal_Name'].str.replace('Elephant', 'Tiger', regex=True)
0

3 Answers 3

3

Special characters!

Using regular expressions with plain words is generally fine (aside from efficiency concerns), there will however be an issue when you have special characters. This is an often overlooked issue and I've seen many people not understanding why their str.replace failed.

Pandas even changed the default regex=True to regex=False, and the original reason for that (#GH24804) was that str.replace('.', '') would remove all characters, which is expected if you know regex, but not at all if you don't.

For example, let's try to replace 1.5 with 2.3 and the $ currency by £:

df = pd.DataFrame({'input': ['This item costs $1.5.', 'We need 125 units.']})

df['out1'] = df['input'].str.replace('1.5', '2.3', regex=False)
df['out1_regex'] = df['input'].str.replace('1.5', '2.3', regex=True)

df['out2'] = df['input'].str.replace('$', '£', regex=False)
df['out2_regex'] = df['input'].str.replace('$', '£', regex=True)

Output:

                   input                   out1             out1_regex  \
0  This item costs $1.5.  This item costs $2.3.  This item costs $2.3.   
1     We need 125 units.     We need 125 units.     We need 2.3 units.   

                    out2              out2_regex  
0  This item costs £1.5.  This item costs $1.5.£  
1     We need 125 units.     We need 125 units.£  

Since . and $ have a special meaning in a regex, those cannot be used as is and should have been escaped (1\.5 / \$), which can be done programmatically with re.escape.

How does str.replace decide to use a regex or a plain string operation?

a pure python replacement will be used if:

  • regex=False
  • pat is a string (passing a compiled regex with regex=False will trigger a ValueError)
  • case is not False
  • no flags are set
  • repl is not a callable

In all other cases, re.sub will be used.

The code that does this is is core/strings/object_array.py:

    def _str_replace(
        self,
        pat: str | re.Pattern,
        repl: str | Callable,
        n: int = -1,
        case: bool = True,
        flags: int = 0,
        regex: bool = True,
    ):
        if case is False:
            # add case flag, if provided
            flags |= re.IGNORECASE

        if regex or flags or callable(repl):
            if not isinstance(pat, re.Pattern):
                if regex is False:
                    pat = re.escape(pat)
                pat = re.compile(pat, flags=flags)

            n = n if n >= 0 else 0
            f = lambda x: pat.sub(repl=repl, string=x, count=n)
        else:
            f = lambda x: x.replace(pat, repl, n)

Efficiency

Considering a pattern without special characters, regex=True is about 6 times slower than regex=False in the linear regime:

comparison regex=True regex=False pandas str.replace

Sign up to request clarification or add additional context in comments.

3 Comments

Certainly true, which is why I wrote the substring should only contain characters (a-z, A-Z) and digits (0-9) :)
@silence_of_the_lambdas I missed that point, but IMO that's the #1 issue with the regex parameter. As I mentioned as comment to another answer, regex=False does not solely determine if a regex will be used. If you use regex=False, case=False, then a regex will be used.
@silence_of_the_lambdas to answer your other comment, I added some more details on how the choice to use the regex engine is done
2

The Pandas str.replace() function uses re.sub() under the hood, when the regex=True flag is set. Otherwise, for regex=False, it uses the Python base replace() string function.

The implementations of re.sub and replace are not the same. In general, we would expect a higher overhead in doing your substring replacement with re.sub versus doing it with replace. One such overhead would be that, when using re.sub (str.replace with regex=True), the 1st parameter first needs to be parsed as a regular expression before it can be used.

In general, you should avoid using regular expressions if you do not need them, so sticking with regex=False is the best option for performance when you do not need regex.

4 Comments

That's not completely true. Yes, str.replace uses both python's str.replace and re.sub but the choice to use one of the other is not done solely based on the regex keyword. For example pd.Series(['ABC']).str.replace('a', 'x', regex=False, case=False) will use re.sub under the hood.
@mozway: Out of curiosity, what other factors influence the choice of function?
@mozway The jist of my answer was invoking a regex engine versus doing a simple replacement. If, under the hood, Pandas is getting smart and ignoring its own regex flag, then it is out of our control anyway.
@silence_of_the_lambdas passing flags and using a callable as replacement / @Tim yes you can see this as a "smart" function. Series.replace is even more "advanced".
2

re.sub is substantially slower than str.replace, even for simple strings like this. You can verify this with the timeit module:

from timeit import timeit

setup="import pandas as pd; series = pd.Series(['Elephant']).repeat(10000)"

print(timeit(setup=setup, stmt="series.str.replace('Elephant', 'Tiger', regex=False)",
             number=10000))
print(timeit(setup=setup, stmt="series.str.replace('Elephant', 'Tiger', regex=True)",
             number=10000))

For me, the results were noticeably different.

11.037849086000051
18.286654022999983

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.