2

I have a unique question, and I am primarily hoping to identify ways to speed up this code a little. I have a set of strings stored in a dataframe, each of which has several names in it and I know the number of names before this step, like so:

print df

description                      num_people        people    
'Harry ran with sally'                2              []         
'Joe was swinging with sally'         2              []
'Lola Dances alone'                   1              []

I am using a dictionary with the keys that I am looking to find in description, like so:

my_dict={'Harry':'1283','Joe':'1828','Sally':'1298', 'Cupid':'1982'}

and then using iterrows to search each string for matches like so:

for index, row in df.iterrows():
    row.people=[key for key in my_dict if re.findall(key,row.desciption)]

and when run it ends up with

print df

 description                      num_people        people    
'Harry ran with sally'                2              ['Harry','Sally']         
'Joe was swinging with sally'         2              ['Joe','Sally']
'Lola Dances alone'                   1              ['Lola']

The problem that I see, is that this code is still fairly slow to get the job done, and I have a large number of descriptions and over 1000 keys. Is there a faster way of performing this operation, like maybe using the number of people found?

1 Answer 1

2

Faster solution:

#strip ' in start and end of text, create lists from words
splited = df.description.str.strip("'").str.split()
#filtering
df['people'] = splited.apply(lambda x: [i for i in x if i in my_dict.keys()])
print (df)
                     description  num_people          people
0         'Harry ran with Sally'           2  [Harry, Sally]
1  'Joe was swinging with Sally'           2    [Joe, Sally]
2            'Lola Dances alone'           1          [Lola]

Timings:

#[30000 rows x 3 columns]
In [198]: %timeit (orig(my_dict, df))
1 loop, best of 3: 3.63 s per loop

In [199]: %timeit (new(my_dict, df1))
10 loops, best of 3: 78.2 ms per loop
df['people'] = [[],[],[]]
df = pd.concat([df]*10000).reset_index(drop=True)
df1 = df.copy()

my_dict={'Harry':'1283','Joe':'1828','Sally':'1298', 'Lola':'1982'}

def orig(my_dict, df):
    for index, row in df.iterrows():
        df.at[index, 'people']=[key for key in my_dict if re.findall(key,row.description)]
    return (df)


def new(my_dict, df):
    df.description = df.description.str.strip("'")
    splited = df.description.str.split()
    df.people = splited.apply(lambda x: [i for i in x if i in my_dict.keys()])
    return (df)


print (orig(my_dict, df))
print (new(my_dict, df1))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.