Multiple column pandas vectorized string function?

Question

Is there a way of querying a DataFrame for rows that contain a certain string in any column? Something like Series.str except for a DataFrame? Here's what I have so far:

In [2]: s = "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est"

In [3]: df = pd.DataFrame(np.array(s.split(' ')).reshape((-1, 4)), columns=['one', 'two', 'three', 'four'])

In [4]: df
Out[4]: 
           one            two         three        four
0        Lorem          ipsum         dolor         sit
1        amet,    consectetur   adipisicing       elit,
2          sed             do       eiusmod      tempor
3   incididunt             ut        labore          et
4       dolore          magna       aliqua.          Ut
5         enim             ad         minim     veniam,
6         quis        nostrud  exercitation     ullamco
7      laboris           nisi            ut     aliquip
8           ex             ea       commodo  consequat.
9         Duis           aute         irure       dolor
10          in  reprehenderit            in   voluptate
11       velit           esse        cillum      dolore
12          eu         fugiat         nulla   pariatur.
13   Excepteur           sint      occaecat   cupidatat
14         non      proident,          sunt          in
15       culpa            qui       officia    deserunt
16      mollit           anim            id         est

[17 rows x 4 columns]

In [5]: mask = df['one'].str.contains('dolor') | df['two'].str.contains('dolor') | df['three'].str.contains('dolor') | df['four'].str.contains('dolor')

In [6]: df[mask]
Out[6]: 
       one    two    three    four
0    Lorem  ipsum    dolor     sit
4   dolore  magna  aliqua.      Ut
9     Duis   aute    irure   dolor
11   velit   esse   cillum  dolore

[4 rows x 4 columns]

Ideally, I would like to replace the last two lines with something similar to this:

df[df.ix[:, 'one':'four'].str.contains('dolor')]

Is this possible?

Saullo G. P. Castro · Accepted Answer · 2014-06-27 13:38:53Z

2

You can use the vectorized operations of a pd.np.char.array():

a = pd.np.char.array(df.values)
mask = a.find('dolor')!=-1
df2 = df.iloc[np.any(mask, axis=1)]

and the content of df2 will be:

       one    two    three    four
0    Lorem  ipsum    dolor     sit
4   dolore  magna  aliqua.      Ut
9     Duis   aute    irure   dolor
11   velit   esse   cillum  dolore

edited Jun 27, 2014 at 13:38

answered Jun 27, 2014 at 13:17

Saullo G. P. Castro

59.4k28 gold badges191 silver badges244 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

yemu Over a year ago

this is definetely the fastest solution: 1000 loops, best of 3: 358 µs per loop

Jeff Over a year ago

note that this will choke on Nan's (this is the reason pandas exposes the str ops, so that they will work with missing values)

unutbu · Accepted Answer · 2014-06-27 12:41:07Z

Pandas does not have DataFrame.str methods (at least not yet). However, you could use

import numpy as np
mask = np.logical_or.reduce(
    [df[col].str.contains('dolor') 
     for col in df.loc[:, 'one':'four'].columns])

This is a little less writing, and a bit quicker than

mask = df['one'].str.contains('dolor') | df['two'].str.contains('dolor') | df['three'].str.contains('dolor') | df['four'].str.contains('dolor')

In [29]: %timeit mask = np.logical_or.reduce([df[col].str.contains('dolor') for col in df.loc[:, 'one':'four'].columns]); df[mask]
1000 loops, best of 3: 761 µs per loop

In [30]: %timeit mask = df['one'].str.contains('dolor') | df['two'].str.contains('dolor') | df['three'].str.contains('dolor') | df['four'].str.contains('dolor'); df[mask]
1000 loops, best of 3: 1.13 ms per loop

yemu · Accepted Answer · 2014-06-27 13:13:12Z

0

this will give you information if theres 'dolor' in any of the columns:

df.ix[:, 'one':'four'].apply(lambda x: x.str.contains('dolor'), axis=1)

will give you true/false value for each row of any of the columns

if you combine this with another apply, you'll get info for the total columns

df.ix[:, 'one':'four'].apply(lambda x: x.str.contains('dolor'), axis=1).apply(lambda x: True in x.values, axis=1)

and using this as the column mask will give your result:

df[df.ix[:, 'one':'four'].apply(lambda x: x.str.contains('dolor'), axis=1).apply(lambda x: True in x.values, axis=1)]

however this is about 3-4 times slower :( that unutbu solutions.

edited Jun 27, 2014 at 13:13

answered Jun 27, 2014 at 12:36

yemu

28.7k10 gold badges34 silver badges29 bronze badges

Collectives™ on Stack Overflow

Multiple column pandas vectorized string function?

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related