0

I have a dataframe with multiple columns, eg:

     Name  Age   Fname
0    Alex   10   Alice
1     Bob   12     Bob
2  Clarke   13  clarke

My filter condition is to check if Name is (case-insensitive) substring of corresponding Fname.

If it was equality, something as simple as:

df[df["Name"].str.lower() == df["Fname"].str.lower()]

works. However, I want substring match, so instead of ==, I thought in would work. But that gives error as it interprets one of the arguments as pd.Series. My 1st question is Why this difference in interpretation?

Another way I tried was using .str.contains:

df[df["Fname"].str.contains(df["Name"], case=False)]

which also interprets df["Name"] as pd.Series, and of course, works for some const string in the argument.

eg. this works:
df[df["Fname"].str.contains("a", case=False)]

I want to resolve this situation, so any help in that regard is appreciated.

4 Answers 4

2

The .str accessor is extremely loopy and slow. It is best most of the times using list comprehension.

import pandas as pd
import numpy as np
import timeit
import matplotlib.pyplot as plt
import pandas.testing as pt

def list_comprehension_lower(df):
    return df[[len(set(i)) == 1 for i in (zip([x.lower() for x in df['Name']],[y.lower() for y in df['Fname']]))]]

def apply_axis_1_lower(df):
    return df[df.apply(lambda x: x['Name'].lower() in x['Fname'].lower(), axis=1)]

def dot_string_lower(df):
    return df[df["Name"].str.lower() == df["Fname"].str.lower()]

fig, ax = plt.subplots()
res = pd.DataFrame(
    index=[1, 5, 10, 30, 50, 100, 300, 500, 700, 1000, 10000],
    columns='list_comprehension_lower apply_axis_1_lower dot_string_lower'.split(),
    dtype=float
)

for i in res.index:
    d = pd.concat([df]*i, ignore_index=True)
    for j in res.columns:
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        res.at[i, j] = timeit.timeit(stmt, setp, number=100)

res.groupby(res.columns.str[4:-1], axis=1).plot(loglog=True, ax=ax);

Output:

enter image description here

Now, back you your original question, You can use list_comprehension with zip and in:

df.loc[2, 'Fname'] += ' Adams'

df[[x in y for x, y in zip([x.lower() for x in df['Name']],[y.lower() for y in df['Fname']])]]

Output:

     Name  Age         Fname
1     Bob   12           Bob
2  Clarke   13  clarke Adams
Sign up to request clarification or add additional context in comments.

3 Comments

Doesn't this graph show that it is better to use .str.lower()? Or am I missing something?
Yes, you are correct beyond a certain number of rows in this case .str lower out performs list comprehension.
that list comprehension is a double for loop, so the chart makes sense that at some point, Pandas str function will outperform it
1

You can iterate over index axis:

>>> df[df.apply(lambda x: x['Name'].lower() in x['Fname'].lower(), axis=1)]

     Name  Age   Fname
1     Bob   12     Bob
2  Clarke   13  clarke

str.contains takes a constant in first argument pat not a Series.

10 Comments

I just updated my answer before I saw yours. Our answers are now identical! :D
Unfortunately, you have not really the choice here.
Wow, then let me see if I can work something out.
For 900K rows, it took 8.13s. It's not really so bad.
Since I am doing a lot of processing before this final filtering, 8-10s does not matter much for me. Hence accepting this answer.
|
1

You could use .apply() with axis=1 to call a function for each row:

subset = df[df.apply(lambda x: x['Name'].lower() in x['Fname'].lower(), axis=1)]

Output:

>>> subset
     Name  Age   Fname
1     Bob   12     Bob
2  Clarke   13  clarke

3 Comments

My bad...there was a mistake in that particular line of code that I mentioned (edited). But please read through the whole question. This is not what I want. My requirement is "substring match"
@vish4071 check again; I updated the answer :)
Isn't this too expensive, esp for big dataframe?
0

Does this other option via a list comprehension work for you:

df.loc[[left.lower() in right.lower() 
        for left, right 
        in zip(df.Name, df.Fname)]
       ]

     Name  Age   Fname
1     Bob   12     Bob
2  Clarke   13  clarke

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.