1

I have dataframe and I try to get only string, where some column contain some strings.

I use:

df_res = pd.DataFrame()
for i in substr:
    res = df[df['event_address'].str.contains(i)]

df looks like:

member_id,event_address,event_time,event_duration
g1497o1ofm5a1963,fotki.yandex.ru/users/atanusha/albums,2015-05-01 00:00:05,8
g1497o1ofm5a1963,9829192.ru/apple-iphone.html,2015-05-01 00:00:15,2
g1497o1ofm5a1963,fotki.yandex.ru/users/atanusha/album/165150?&p=3,2015-05-01 00:00:17,2
g1497o1ofm5a1963,fotki.yandex.ru/tags/%D0%B1%D0%BE%D1%81%D0%B8%D0%BA%D0%BE%D0%BC?text=%D0%B1%D0%BE%D1%81%D0%B8%D0%BA%D0%BE%D0%BC&search_author=utpaladev&&p=2,2015-05-01 00:01:31,10
g1497o1ofm5a1963,3gmaster.net,2015-05-01 00:01:41,6
g1497o1ofm5a1963,fotki.yandex.ru/search.xml?text=%D0%B1%D0%BE%D1%81%D0%B8%D0%BA%D0%BE%D0%BC&&p=2,2015-05-01 00:02:01,6
g1497o1ofm5a1963,fotki.yandex.ru/search.xml?text=%D0%B1%D0%BE%D1%81%D0%B8%D0%BA%D0%BE%D0%BC&search_author=Sunny-Fanny&,2015-05-01 00:02:31,2
g1497o1ofm5a1963,fotki.9829192.ru/apple-iphone.html,2015-05-01 00:03:25,6

and substr is:

123.ru/gadgets/communicators
320-8080.ru/mobilephones
3gmaster.net
3-q.ru/products/smartfony/s
9829192.ru/apple-iphone.html
9829192.ru/index.php?cat=1
acer.com/ac/ru/ru/content/group/smartphones
aj.ru

I get desirable result with this code, but it's loo long. I also try to use column(substr it's a substr = urls.url.values.tolist()) and I try

res = df[df['event_address'].str.contains(urls.url)]

but it returns:

TypeError: 'Series' objects are mutable, thus they cannot be hashed

Is it any way to make it more faster or maybe I'm wrong?

1
  • Which type is substr? Is that a list of strings? Commented Oct 4, 2016 at 8:56

2 Answers 2

1

I think you need add join by | to str.contains if need faster solution:

res = df[df['event_address'].str.contains('|'.join(urls.url))]
print (res)
          member_id                       event_address           event_time  \
1  g1497o1ofm5a1963        9829192.ru/apple-iphone.html  2015-05-01 00:00:15   
4  g1497o1ofm5a1963                        3gmaster.net  2015-05-01 00:01:41   
7  g1497o1ofm5a1963  fotki.9829192.ru/apple-iphone.html  2015-05-01 00:03:25   

   event_duration  
1               2  
4               6  
7               6  

Another list comprehension solution:

res = df[df['event_address'].apply(lambda x: any([n in x for n in urls.url.tolist()]))]
print (res)
          member_id                       event_address           event_time  \
1  g1497o1ofm5a1963        9829192.ru/apple-iphone.html  2015-05-01 00:00:15   
4  g1497o1ofm5a1963                        3gmaster.net  2015-05-01 00:01:41   
7  g1497o1ofm5a1963  fotki.9829192.ru/apple-iphone.html  2015-05-01 00:03:25   

   event_duration  
1               2  
4               6  
7               6  

Timings:

#[8000 rows x 4 columns]
df = pd.concat([df]*1000).reset_index(drop=True)

In [68]: %timeit (df[df['event_address'].str.contains('|'.join(urls.url))])
100 loops, best of 3: 12 ms per loop

In [69]: %timeit (df.ix[df.event_address.map(check_exists)])
10 loops, best of 3: 155 ms per loop

In [70]: %timeit (df.ix[df.event_address.map(lambda x: any([True for i in urls.url.tolist() if i in x]))])
10 loops, best of 3: 163 ms per loop

In [71]: %timeit (df[df['event_address'].apply(lambda x: any([n in x for n in urls.url.tolist()] ))])
10 loops, best of 3: 174 ms per loop
Sign up to request clarification or add additional context in comments.

1 Comment

I tried df['event_address'].str.contains('|'.join(urls.url)) because I need to add regex=True, but it return me sre_constants.error: multiple repeat
1

do like this:

def check_exists(x):
    for i in substr:
        if i in x:
            return True
    return False

df2 = df.ix[df.event_address.map(check_exists)]

or if you like write it in one-line:

df.ix[df.event_address.map(lambda x: any([True for i in substr if i in x]))]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.