Pandas filter dataframe columns through substring match

Question

I have a dataframe with multiple columns, eg:

     Name  Age   Fname
0    Alex   10   Alice
1     Bob   12     Bob
2  Clarke   13  clarke

My filter condition is to check if Name is (case-insensitive) substring of corresponding Fname.

If it was equality, something as simple as:

df[df["Name"].str.lower() == df["Fname"].str.lower()]

works. However, I want substring match, so instead of ==, I thought in would work. But that gives error as it interprets one of the arguments as pd.Series. My 1st question is Why this difference in interpretation?

Another way I tried was using .str.contains:

df[df["Fname"].str.contains(df["Name"], case=False)]

which also interprets df["Name"] as pd.Series, and of course, works for some const string in the argument.

eg. this works:
df[df["Fname"].str.contains("a", case=False)]

I want to resolve this situation, so any help in that regard is appreciated.

Scott Boston · Accepted Answer · 2021-12-04 22:04:55Z

2

The .str accessor is extremely loopy and slow. It is best most of the times using list comprehension.

import pandas as pd
import numpy as np
import timeit
import matplotlib.pyplot as plt
import pandas.testing as pt

def list_comprehension_lower(df):
    return df[[len(set(i)) == 1 for i in (zip([x.lower() for x in df['Name']],[y.lower() for y in df['Fname']]))]]

def apply_axis_1_lower(df):
    return df[df.apply(lambda x: x['Name'].lower() in x['Fname'].lower(), axis=1)]

def dot_string_lower(df):
    return df[df["Name"].str.lower() == df["Fname"].str.lower()]

fig, ax = plt.subplots()
res = pd.DataFrame(
    index=[1, 5, 10, 30, 50, 100, 300, 500, 700, 1000, 10000],
    columns='list_comprehension_lower apply_axis_1_lower dot_string_lower'.split(),
    dtype=float
)

for i in res.index:
    d = pd.concat([df]*i, ignore_index=True)
    for j in res.columns:
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        res.at[i, j] = timeit.timeit(stmt, setp, number=100)

res.groupby(res.columns.str[4:-1], axis=1).plot(loglog=True, ax=ax);

Output:

Now, back you your original question, You can use list_comprehension with zip and in:

df.loc[2, 'Fname'] += ' Adams'

df[[x in y for x, y in zip([x.lower() for x in df['Name']],[y.lower() for y in df['Fname']])]]

Output:

     Name  Age         Fname
1     Bob   12           Bob
2  Clarke   13  clarke Adams

answered Dec 4, 2021 at 22:04

Scott Boston

154k15 gold badges160 silver badges207 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

vish4071 Over a year ago

Doesn't this graph show that it is better to use .str.lower()? Or am I missing something?

Scott Boston Over a year ago

Yes, you are correct beyond a certain number of rows in this case .str lower out performs list comprehension.

sammywemmy Over a year ago

that list comprehension is a double for loop, so the chart makes sense that at some point, Pandas str function will outperform it

Corralien · Accepted Answer · 2021-12-04 21:37:22Z

1

You can iterate over index axis:

>>> df[df.apply(lambda x: x['Name'].lower() in x['Fname'].lower(), axis=1)]

     Name  Age   Fname
1     Bob   12     Bob
2  Clarke   13  clarke

str.contains takes a constant in first argument pat not a Series.

answered Dec 4, 2021 at 21:37

Corralien

121k8 gold badges43 silver badges68 bronze badges

10 Comments

user17242583 Over a year ago

I just updated my answer before I saw yours. Our answers are now identical! :D

Corralien Over a year ago

Unfortunately, you have not really the choice here.

user17242583 Over a year ago

Wow, then let me see if I can work something out.

Corralien Over a year ago

For 900K rows, it took 8.13s. It's not really so bad.

vish4071 Over a year ago

Since I am doing a lot of processing before this final filtering, 8-10s does not matter much for me. Hence accepting this answer.

|

score 1 · Accepted Answer · 2021-12-04 21:38:09Z

1

You could use .apply() with axis=1 to call a function for each row:

subset = df[df.apply(lambda x: x['Name'].lower() in x['Fname'].lower(), axis=1)]

Output:

>>> subset
     Name  Age   Fname
1     Bob   12     Bob
2  Clarke   13  clarke

edited Dec 4, 2021 at 21:38

answered Dec 4, 2021 at 21:32

user17242583

3 Comments

vish4071 Over a year ago

My bad...there was a mistake in that particular line of code that I mentioned (edited). But please read through the whole question. This is not what I want. My requirement is "substring match"

user17242583 Over a year ago

@vish4071 check again; I updated the answer :)

vish4071 Over a year ago

Isn't this too expensive, esp for big dataframe?

sammywemmy · Accepted Answer · 2021-12-04 22:55:53Z

0

Does this other option via a list comprehension work for you:

df.loc[[left.lower() in right.lower() 
        for left, right 
        in zip(df.Name, df.Fname)]
       ]

     Name  Age   Fname
1     Bob   12     Bob
2  Clarke   13  clarke

answered Dec 4, 2021 at 22:55

sammywemmy

28.9k4 gold badges21 silver badges35 bronze badges

Collectives™ on Stack Overflow

Pandas filter dataframe columns through substring match

4 Answers 4

3 Comments

10 Comments

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

10 Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related