Pandas: improve algorithm with find substring in column

Question

I have dataframe and I try to get only string, where some column contain some strings.

I use:

df_res = pd.DataFrame()
for i in substr:
    res = df[df['event_address'].str.contains(i)]

df looks like:

member_id,event_address,event_time,event_duration
g1497o1ofm5a1963,fotki.yandex.ru/users/atanusha/albums,2015-05-01 00:00:05,8
g1497o1ofm5a1963,9829192.ru/apple-iphone.html,2015-05-01 00:00:15,2
g1497o1ofm5a1963,fotki.yandex.ru/users/atanusha/album/165150?&p=3,2015-05-01 00:00:17,2
g1497o1ofm5a1963,fotki.yandex.ru/tags/%D0%B1%D0%BE%D1%81%D0%B8%D0%BA%D0%BE%D0%BC?text=%D0%B1%D0%BE%D1%81%D0%B8%D0%BA%D0%BE%D0%BC&search_author=utpaladev&&p=2,2015-05-01 00:01:31,10
g1497o1ofm5a1963,3gmaster.net,2015-05-01 00:01:41,6
g1497o1ofm5a1963,fotki.yandex.ru/search.xml?text=%D0%B1%D0%BE%D1%81%D0%B8%D0%BA%D0%BE%D0%BC&&p=2,2015-05-01 00:02:01,6
g1497o1ofm5a1963,fotki.yandex.ru/search.xml?text=%D0%B1%D0%BE%D1%81%D0%B8%D0%BA%D0%BE%D0%BC&search_author=Sunny-Fanny&,2015-05-01 00:02:31,2
g1497o1ofm5a1963,fotki.9829192.ru/apple-iphone.html,2015-05-01 00:03:25,6

and substr is:

123.ru/gadgets/communicators
320-8080.ru/mobilephones
3gmaster.net
3-q.ru/products/smartfony/s
9829192.ru/apple-iphone.html
9829192.ru/index.php?cat=1
acer.com/ac/ru/ru/content/group/smartphones
aj.ru

I get desirable result with this code, but it's loo long. I also try to use column(substr it's a substr = urls.url.values.tolist()) and I try

res = df[df['event_address'].str.contains(urls.url)]

but it returns:

TypeError: 'Series' objects are mutable, thus they cannot be hashed

Is it any way to make it more faster or maybe I'm wrong?

Which type is substr? Is that a list of strings?

albert
– albert

2016-10-04 08:56:50 +00:00
Commented Oct 4, 2016 at 8:56 — albert
– albert, Commented Oct 4, 2016 at 8:56

jezrael · Accepted Answer · 2016-10-04 10:13:45Z

1

I think you need add join by | to str.contains if need faster solution:

res = df[df['event_address'].str.contains('|'.join(urls.url))]
print (res)
          member_id                       event_address           event_time  \
1  g1497o1ofm5a1963        9829192.ru/apple-iphone.html  2015-05-01 00:00:15   
4  g1497o1ofm5a1963                        3gmaster.net  2015-05-01 00:01:41   
7  g1497o1ofm5a1963  fotki.9829192.ru/apple-iphone.html  2015-05-01 00:03:25   

   event_duration  
1               2  
4               6  
7               6

Another list comprehension solution:

res = df[df['event_address'].apply(lambda x: any([n in x for n in urls.url.tolist()]))]
print (res)
          member_id                       event_address           event_time  \
1  g1497o1ofm5a1963        9829192.ru/apple-iphone.html  2015-05-01 00:00:15   
4  g1497o1ofm5a1963                        3gmaster.net  2015-05-01 00:01:41   
7  g1497o1ofm5a1963  fotki.9829192.ru/apple-iphone.html  2015-05-01 00:03:25   

   event_duration  
1               2  
4               6  
7               6

Timings:

#[8000 rows x 4 columns]
df = pd.concat([df]*1000).reset_index(drop=True)

In [68]: %timeit (df[df['event_address'].str.contains('|'.join(urls.url))])
100 loops, best of 3: 12 ms per loop

In [69]: %timeit (df.ix[df.event_address.map(check_exists)])
10 loops, best of 3: 155 ms per loop

In [70]: %timeit (df.ix[df.event_address.map(lambda x: any([True for i in urls.url.tolist() if i in x]))])
10 loops, best of 3: 163 ms per loop

In [71]: %timeit (df[df['event_address'].apply(lambda x: any([n in x for n in urls.url.tolist()] ))])
10 loops, best of 3: 174 ms per loop

edited Oct 4, 2016 at 10:13

answered Oct 4, 2016 at 8:52

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Petr Petrov Over a year ago

I tried df['event_address'].str.contains('|'.join(urls.url)) because I need to add regex=True, but it return me sre_constants.error: multiple repeat

Howardyan · Accepted Answer · 2016-10-04 09:52:44Z

1

do like this:

def check_exists(x):
    for i in substr:
        if i in x:
            return True
    return False

df2 = df.ix[df.event_address.map(check_exists)]

or if you like write it in one-line:

df.ix[df.event_address.map(lambda x: any([True for i in substr if i in x]))]

edited Oct 4, 2016 at 9:52

answered Oct 4, 2016 at 9:07

Howardyan

6671 gold badge6 silver badges15 bronze badges

Collectives™ on Stack Overflow

Pandas: improve algorithm with find substring in column

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related