2

I have a Pandas Dataframe containing multiple colums of strings. I now like to check a certain column against a list of allowed substrings and then get a new subset with the result.

substr = ['A', 'C', 'D']
df = pd.read_excel('output.xlsx')
df = df.dropna()
# now filter all rows where the string in the 2nd column doesn't contain one of the substrings

The only approach I found was creating a List of the corresponding column an then do a list comprehension, but then I loose the other columns. Can I use list comprehension as part of e.g. df.str.contains()?

year  type     value   price
2000  ty-A     500     10000
2002  ty-Q     200     84600
2003  ty-R     500     56000
2003  ty-B     500     18000
2006  ty-C     500     12500
2012  ty-A     500     65000
2018  ty-F     500     86000
2019  ty-D     500     51900

expected output:

year  type     value   price
2000  ty-A     500     10000
2006  ty-C     500     12500
2012  ty-A     500     65000
2019  ty-D     500     51900
5
  • Please share a sample dataframe and expected output Commented Sep 4, 2019 at 9:42
  • 3
    Try df[df.iloc[:, 2].str.contains('|'.join(substr))] Commented Sep 4, 2019 at 9:44
  • @yatu, is there an easy way to format tables into a question? Commented Sep 4, 2019 at 9:45
  • Just paste the data directly. Make sure to TAB it before pasting Commented Sep 4, 2019 at 9:46
  • Possible duplicate of How to implement 'in' and 'not in' for Pandas dataframe Commented Sep 4, 2019 at 9:59

2 Answers 2

3

You could use pandas.Series.isin

>>> df.loc[df['type'].isin(substr)]
   year type  value  price
0  2000    A    500  10000
4  2006    C    500  12500
5  2012    A    500  65000
7  2019    D    500  51900
Sign up to request clarification or add additional context in comments.

1 Comment

Sorry my fault, I really need it to be a substring, I edited my table accordingly. But in combination with Chri's approach, that worked out! df.loc[df['type].str.contains('|'.join(substr))]
1

you could use pandas.DataFrame.any or pandas.DataFrame.all

if you want where all instances match

df.loc[df['type'].apply(lambda x: all( word in x for word in substr)

or if you want any from the substr

df.loc[df['type'].apply(lambda x: any( word in x for word in substr)

That should if you print or return df a filtered list.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.