Filtering a pandas dataframe based presence of substrings in column

Question

Not sure if this is a 'filtering with pandas' question or one of text analysis, however:

Given a df,

d = {
    "item": ["a", "b", "c", "d"],
    "report": [
        "john rode the subway through new york",
        "sally says she no longer wanted any fish, but",
        "was not submitted",
        "the doctor proceeded to call washington and new york",
    ],
}
df = pd.DataFrame(data=d)
df

Resulting in

item, report
a, "john rode the subway through new york"
b, "sally says she no longer wanted any fish, but"
c, "was not submitted"
d, "the doctor proceeded to call washington and new york"

And a list of terms to match:

terms = ["new york", "fish"]

How would you reduce the the df to have the following rows, based on whether a substring in terms is found in column report and so that item is preserved?

item, report
a, "john rode the subway through new york"
b, "sally says she no longer wanted any fish, but"
d, "the doctor proceeded to call washington and new york"

rhug123 · Accepted Answer · 2023-05-04 14:34:58Z

2

Try this:

Using a word boundary with your regex will ensure that "fish" will get matched, but "fishy" will not (as an example)

m = df['report'].str.contains(r'\b{}\b'.format(r'\b|\b'.join(terms)))

df2 = df.loc[m]

Output:

  item                                             report
0    a              john rode the subway through new york
1    b      sally says she no longer wanted any fish, but
3    d  the doctor proceeded to call washington and ne...

edited May 4, 2023 at 14:34

answered Feb 8, 2023 at 18:17

rhug123

8,8801 gold badge14 silver badges27 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

PaulS · Accepted Answer · 2023-02-08 22:21:36Z

1

Another possible solution, which is based on numpy:

strings = np.array(df['report'], dtype=str)
substrings = np.array(terms)

index = np.char.find(strings[:, None], substrings)
mask = (index >= 0).any(axis=1)

df.loc[mask]

Output:

  item                                             report
0    a              john rode the subway through new york
1    b      sally says she no longer wanted any fish, but
3    d  the doctor proceeded to call washington and ne...

answered Feb 8, 2023 at 22:21

PaulS

27.1k3 gold badges18 silver badges40 bronze badges

Comments

scotscotmcc · Accepted Answer · 2023-02-08 18:15:07Z

1

Pulling from another answer here:

You can change your terms into a regex-usable single string (that is, | delimited) and then use df.Series.str.contains.

term_str = '|'.join(terms) # makes a string of 'new york|fish'
df[df['report'].str.contains(term_str)]

answered Feb 8, 2023 at 18:15

scotscotmcc

3,1631 gold badge10 silver badges35 bronze badges

Comments

Sunderam Dubey · Accepted Answer · 2023-02-08 19:09:08Z

1

Try this:

df[df['report'].apply(lambda x: any(term in x for term in terms))]

Output:

  item                                             report
0    a              john rode the subway through new york
1    b      sally says she no longer wanted any fish, but
3    d  the doctor proceeded to call washington and ne...

answered Feb 8, 2023 at 19:09

Sunderam Dubey

8,83512 gold badges25 silver badges43 bronze badges

Collectives™ on Stack Overflow

Filtering a pandas dataframe based presence of substrings in column

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related