Pandas string replace with regex argument for non-regex replacements

Question

Suppose I have a dataframe in which I want to replace a non-regex substring consisting only of characters (i.e. a-z, A-Z) and/or digits (i.e. 0-9) via pd.Series.str.replace. The docs state that this function is equivalent to str.replace or re.sub(), depending on the regex argument (default False).

Apart from most likely being overkill, are there any downsides to consider if the function was called with regex=True for non-regex replacements (e.g. performance)? If so, which ones? Of course, I am not suggesting using the function in this way.

Example: Replace 'Elephant' in the below dataframe.

import pandas as pd

data = {'Animal_Name': ['Elephant African', 'Elephant Asian', 'Elephant Indian', 'Elephant Borneo', 'Elephant Sumatran']}
df = pd.DataFrame(data)

df = df['Animal_Name'].str.replace('Elephant', 'Tiger', regex=True)

mozway · Accepted Answer · 2024-08-20 09:56:08Z

3

Special characters!

Using regular expressions with plain words is generally fine (aside from efficiency concerns), there will however be an issue when you have special characters. This is an often overlooked issue and I've seen many people not understanding why their str.replace failed.

Pandas even changed the default regex=True to regex=False, and the original reason for that (#GH24804) was that str.replace('.', '') would remove all characters, which is expected if you know regex, but not at all if you don't.

For example, let's try to replace 1.5 with 2.3 and the $ currency by £:

df = pd.DataFrame({'input': ['This item costs $1.5.', 'We need 125 units.']})

df['out1'] = df['input'].str.replace('1.5', '2.3', regex=False)
df['out1_regex'] = df['input'].str.replace('1.5', '2.3', regex=True)

df['out2'] = df['input'].str.replace('$', '£', regex=False)
df['out2_regex'] = df['input'].str.replace('$', '£', regex=True)

Output:

                   input                   out1             out1_regex  \
0  This item costs $1.5.  This item costs $2.3.  This item costs $2.3.   
1     We need 125 units.     We need 125 units.     We need 2.3 units.   

                    out2              out2_regex  
0  This item costs £1.5.  This item costs $1.5.£  
1     We need 125 units.     We need 125 units.£

Since . and $ have a special meaning in a regex, those cannot be used as is and should have been escaped (1\.5 / \$), which can be done programmatically with re.escape.

How does `str.replace` decide to use a regex or a plain string operation?

a pure python replacement will be used if:

regex=False
pat is a string (passing a compiled regex with regex=False will trigger a ValueError)
case is not False
no flags are set
repl is not a callable

In all other cases, re.sub will be used.

The code that does this is is core/strings/object_array.py:

    def _str_replace(
        self,
        pat: str | re.Pattern,
        repl: str | Callable,
        n: int = -1,
        case: bool = True,
        flags: int = 0,
        regex: bool = True,
    ):
        if case is False:
            # add case flag, if provided
            flags |= re.IGNORECASE

        if regex or flags or callable(repl):
            if not isinstance(pat, re.Pattern):
                if regex is False:
                    pat = re.escape(pat)
                pat = re.compile(pat, flags=flags)

            n = n if n >= 0 else 0
            f = lambda x: pat.sub(repl=repl, string=x, count=n)
        else:
            f = lambda x: x.replace(pat, repl, n)

Efficiency

Considering a pattern without special characters, regex=True is about 6 times slower than regex=False in the linear regime:

edited Aug 20, 2024 at 9:56

answered Aug 20, 2024 at 9:03

mozway

267k13 gold badges56 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

silence_of_the_lambdas Over a year ago

Certainly true, which is why I wrote the substring should only contain characters (a-z, A-Z) and digits (0-9) :)

mozway Over a year ago

@silence_of_the_lambdas I missed that point, but IMO that's the #1 issue with the regex parameter. As I mentioned as comment to another answer, regex=False does not solely determine if a regex will be used. If you use regex=False, case=False, then a regex will be used.

mozway Over a year ago

@silence_of_the_lambdas to answer your other comment, I added some more details on how the choice to use the regex engine is done

Tim Biegeleisen · Accepted Answer · 2024-08-20 09:02:19Z

2

The Pandas str.replace() function uses re.sub() under the hood, when the regex=True flag is set. Otherwise, for regex=False, it uses the Python base replace() string function.

The implementations of re.sub and replace are not the same. In general, we would expect a higher overhead in doing your substring replacement with re.sub versus doing it with replace. One such overhead would be that, when using re.sub (str.replace with regex=True), the 1st parameter first needs to be parsed as a regular expression before it can be used.

In general, you should avoid using regular expressions if you do not need them, so sticking with regex=False is the best option for performance when you do not need regex.

answered Aug 20, 2024 at 9:02

Tim Biegeleisen

526k32 gold badges323 silver badges399 bronze badges

4 Comments

mozway Over a year ago

That's not completely true. Yes, str.replace uses both python's str.replace and re.sub but the choice to use one of the other is not done solely based on the regex keyword. For example pd.Series(['ABC']).str.replace('a', 'x', regex=False, case=False) will use re.sub under the hood.

silence_of_the_lambdas Over a year ago

@mozway: Out of curiosity, what other factors influence the choice of function?

Tim Biegeleisen Over a year ago

@mozway The jist of my answer was invoking a regex engine versus doing a simple replacement. If, under the hood, Pandas is getting smart and ignoring its own regex flag, then it is out of our control anyway.

mozway Over a year ago

@silence_of_the_lambdas passing flags and using a callable as replacement / @Tim yes you can see this as a "smart" function. Series.replace is even more "advanced".

Anerdw · Accepted Answer · 2024-08-20 09:17:48Z

2

re.sub is substantially slower than str.replace, even for simple strings like this. You can verify this with the timeit module:

from timeit import timeit

setup="import pandas as pd; series = pd.Series(['Elephant']).repeat(10000)"

print(timeit(setup=setup, stmt="series.str.replace('Elephant', 'Tiger', regex=False)",
             number=10000))
print(timeit(setup=setup, stmt="series.str.replace('Elephant', 'Tiger', regex=True)",
             number=10000))

For me, the results were noticeably different.

11.037849086000051
18.286654022999983

answered Aug 20, 2024 at 9:17

Anerdw

2,8163 gold badges18 silver badges44 bronze badges

Collectives™ on Stack Overflow

Pandas string replace with regex argument for non-regex replacements

3 Answers 3

Special characters!

How does `str.replace` decide to use a regex or a plain string operation?

Efficiency

3 Comments

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Special characters!

How does str.replace decide to use a regex or a plain string operation?

Efficiency

3 Comments

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related

How does `str.replace` decide to use a regex or a plain string operation?