Select rows containing certain values from pandas dataframe

Question

I have a pandas dataframe whose entries are all strings:

   A     B      C
1 apple  banana pear
2 pear   pear   apple
3 banana pear   pear
4 apple  apple  pear

etc. I want to select all the rows that contain a certain string, say, 'banana'. I don't know which column it will appear in each time. Of course, I can write a for loop and iterate over all rows. But is there an easier or faster way to do this?

@JoeT.Boka, that gives me a row for every match, so if a row has two 'banana' values, I get two rows with the same index. Not something that can't be handled, but it does require further handling. — Kaleb Coberly
– Kaleb Coberly, Commented Mar 4, 2021 at 22:55

Divakar · Accepted Answer · 2020-11-12 06:23:42Z

Introduction

At the heart of selecting rows, we would need a 1D mask or a pandas-series of boolean elements of length same as length of df, let's call it mask. So, finally with df[mask], we would get the selected rows off df following boolean-indexing.

Here's our starting df :

In [42]: df
Out[42]: 
        A       B      C
1   apple  banana   pear
2    pear    pear  apple
3  banana    pear   pear
4   apple   apple   pear

I. Match one string

Now, if we need to match just one string, it's straight-foward with elementwise equality :

In [42]: df == 'banana'
Out[42]: 
       A      B      C
1  False   True  False
2  False  False  False
3   True  False  False
4  False  False  False

If we need to look ANY one match in each row, use .any method :

In [43]: (df == 'banana').any(axis=1)
Out[43]: 
1     True
2    False
3     True
4    False
dtype: bool

To select corresponding rows :

In [44]: df[(df == 'banana').any(axis=1)]
Out[44]: 
        A       B     C
1   apple  banana  pear
3  banana    pear  pear

II. Match multiple strings

1. Search for ANY match

Here's our starting df :

In [42]: df
Out[42]: 
        A       B      C
1   apple  banana   pear
2    pear    pear  apple
3  banana    pear   pear
4   apple   apple   pear

NumPy's np.isin would work here (or use pandas.isin as listed in other posts) to get all matches from the list of search strings in df. So, say we are looking for 'pear' or 'apple' in df :

In [51]: np.isin(df, ['pear','apple'])
Out[51]: 
array([[ True, False,  True],
       [ True,  True,  True],
       [False,  True,  True],
       [ True,  True,  True]])

# ANY match along each row
In [52]: np.isin(df, ['pear','apple']).any(axis=1)
Out[52]: array([ True,  True,  True,  True])

# Select corresponding rows with masking
In [56]: df[np.isin(df, ['pear','apple']).any(axis=1)]
Out[56]: 
        A       B      C
1   apple  banana   pear
2    pear    pear  apple
3  banana    pear   pear
4   apple   apple   pear

2. Search for ALL match

Here's our starting df again :

In [42]: df
Out[42]: 
        A       B      C
1   apple  banana   pear
2    pear    pear  apple
3  banana    pear   pear
4   apple   apple   pear

So, now we are looking for rows that have BOTH say ['pear','apple']. We will make use of NumPy-broadcasting :

In [66]: np.equal.outer(df.to_numpy(copy=False),  ['pear','apple']).any(axis=1)
Out[66]: 
array([[ True,  True],
       [ True,  True],
       [ True, False],
       [ True,  True]])

So, we have a search list of 2 items and hence we have a 2D mask with number of rows = len(df) and number of cols = number of search items. Thus, in the above result, we have the first col for 'pear' and second one for 'apple'.

To make things concrete, let's get a mask for three items ['apple','banana', 'pear'] :

In [62]: np.equal.outer(df.to_numpy(copy=False),  ['apple','banana', 'pear']).any(axis=1)
Out[62]: 
array([[ True,  True,  True],
       [ True, False,  True],
       [False,  True,  True],
       [ True, False,  True]])

The columns of this mask are for 'apple','banana', 'pear' respectively.

Back to 2 search items case, we had earlier :

In [66]: np.equal.outer(df.to_numpy(copy=False),  ['pear','apple']).any(axis=1)
Out[66]: 
array([[ True,  True],
       [ True,  True],
       [ True, False],
       [ True,  True]])

Since, we are looking for ALL matches in each row :

In [67]: np.equal.outer(df.to_numpy(copy=False),  ['pear','apple']).any(axis=1).all(axis=1)
Out[67]: array([ True,  True, False,  True])

Finally, select rows :

In [70]: df[np.equal.outer(df.to_numpy(copy=False),  ['pear','apple']).any(axis=1).all(axis=1)]
Out[70]: 
       A       B      C
1  apple  banana   pear
2   pear    pear  apple
4  apple   apple   pear

Actually this is easier to use when searching for multiple strings

Merlin · Accepted Answer · 2016-07-04 16:19:15Z

25

For single search value

df[df.values  == "banana"]

or

 df[df.isin(['banana'])]

For multiple search terms:

  df[(df.values  == "banana")|(df.values  == "apple" ) ]

or

df[df.isin(['banana', "apple"])]

  #         A       B      C
  #  1   apple  banana    NaN
  #  2     NaN     NaN  apple
  #  3  banana     NaN    NaN
  #  4   apple   apple    NaN

From Divakar: lines with both are returned.

select_rows(df,['apple','banana'])

 #         A       B     C
 #   0  apple  banana  pear

edited Jul 4, 2016 at 16:19

answered Jul 4, 2016 at 15:06

Merlin

25.9k44 gold badges141 silver badges213 bronze badges

Comments

avim · Accepted Answer · 2021-09-09 13:19:02Z

6

If you want all the rows of df that contain any of the values in values, use:

df[df.isin(values).any(1)]

Example:

In [2]: df                                                                                                                       
Out[2]: 
   0  1  2
0  7  4  9
1  8  2  7
2  1  9  7
3  3  8  5
4  5  1  1

In [3]: df[df.isin({1, 9, 123}).any(1)]                                                                                          
Out[3]: 
   0  1  2
0  7  4  9
2  1  9  7
4  5  1  1

edited Sep 9, 2021 at 13:19

answered Dec 13, 2019 at 11:24

avim

1,01911 silver badges25 bronze badges

Comments

EdChum · Accepted Answer · 2016-07-04 13:34:11Z

3

You can create a boolean mask from comparing the entire df against your string and call dropna passing param how='all' to drop rows where your string doesn't appear in all cols:

In [59]:
df[df == 'banana'].dropna(how='all')

Out[59]:
        A       B    C
1     NaN  banana  NaN
3  banana     NaN  NaN

To test for multiple values you can use multiple masks:

In [90]:
banana = df[(df=='banana')].dropna(how='all')
banana

Out[90]:
        A       B    C
1     NaN  banana  NaN
3  banana     NaN  NaN

In [91]:    
apple = df[(df=='apple')].dropna(how='all')
apple

Out[91]:
       A      B      C
1  apple    NaN    NaN
2    NaN    NaN  apple
4  apple  apple    NaN

You can use index.intersection to index just the common index values:

In [93]:
df.loc[apple.index.intersection(banana.index)]

Out[93]:
       A       B     C
1  apple  banana  pear

edited Jul 4, 2016 at 13:34

answered Jul 4, 2016 at 13:15

EdChum

397k204 gold badges836 silver badges583 bronze badges

11 Comments

ylangylang Over a year ago

Thank you. This certainly works if I'm looking for one string. What if I want to select rows that contain both 'banana' and 'apple'?

Essex Over a year ago

I don't know pandas, but maybe something like that : df[df == 'banana', 'apple'].dropna(how='all')?

ylangylang Over a year ago

@Andromedae93 That gives me a TypeError

Essex Over a year ago

@mcglashan I never used pandas, but isin function should work. Documentation : pandas.pydata.org/pandas-docs/stable/generated/…

EdChum Over a year ago

@JoeR pure numpy methods will always be faster but the pandas methods have better type and missing data handling, for this toy example and where the dtype is homogenous then a pure np method is superior

|

Collectives™ on Stack Overflow

Select rows containing certain values from pandas dataframe

4 Answers 4

Introduction

I. Match one string

II. Match multiple strings

1 Comment

Comments

Comments

11 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Introduction

I. Match one string

II. Match multiple strings

1 Comment

Comments

Comments

11 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related