applying custom function in pandas dataframe

Question

I have a dataframe:df_input1 having 10M rows. One of the column name is "geolocation". For all of the records, I have to find the state name from the geolocation and fill up a column "state" of another dataframe: df_final. For that I have created a function convert_to_state and using this like below:

 df_final['State']  = df_input1['geolocations'].apply(convert_to_state)

Is there any faster way to achieve this since this is taking lot of time.

sample data: df_input1

vehicle-no start                end                   geolocations
123        10/12/2019 09:00:12  10/12/2019 11:00:78   fghdrf3245@bafd
456        12/10/2019 06:09:12  10/10/2019 09:23:12   {098ddc76yhfbdb7877]

The custom function:

import reverse_geocoder as rg 
import polyline
def convert_to_state(geoloc):
    long_lat = polyline.decode(geoloc)[0]     
    state_name= rg.search(long_lat)[0]["admin1"]
    return state_name

Can you provide 1) sample data 2) convert_to_state function. Optimization will depend on the kind of operation you are doing. Also you can explore dask and swifter — Anant Gupta
– Anant Gupta, Commented Mar 26, 2020 at 10:38
@AnantGupta I have edited the question with sample dataset and the custom function. — Dcook
– Dcook, Commented Mar 26, 2020 at 10:49

dumbPy · Accepted Answer · 2020-03-26 11:37:14Z

I would suggest using numpy to make a vectorized function

import numpy as np
import pandas as pd
import reverse_geocoder as rg 
import polyline
def convert_to_state(geoloc):
    long_lat = polyline.decode(geoloc)[0]     
    state_name= rg.search(long_lat)[0]["admin1"]
    return state_name


convert_to_state = np.vectorize(convert_to_state) # vectorize the method

col = df_input1['geolocations'].values # A numpy array of the column
df_final['State']  = pd.Series(convert_to_state(col))

The vectorized function running on the numpy array would give a substantial boost, then you convert it back to pandas Series.

I would highly recommend timing this method and the normal .apply method with %timeit decorator in ipython and report the runtimes on smaller subsets

Here is a very stupid example

In [1]: import pandas as pd                               

In [2]: import numpy as np                                

In [3]: x = pd.DataFrame( 
   ...:     [ 
   ...:         [1,2,"Some.Text"], 
   ...:         [3,4,"More.Text"] 
   ...:     ], 
   ...:     columns = ["A","B", "C"] 
   ...: )                                                 

In [4]: x                                                 
Out[4]: 
   A  B          C
0  1  2  Some.Text
1  3  4  More.Text

In [5]: def foo_split(t): 
   ...:     return t.split(".")[0] 
   ...:                                                   

In [6]: %timeit y = x.C.apply(foo_split)                  
248 µs ± 4.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [7]: c = x.C.values # numpy array of the column        

In [8]: foo_split_vect = np.vectorize(foo_split)          

In [9]: %timeit z = pd.Series(foo_split_vect(c))          
159 µs ± 624 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

You may see that the speed basically doubles in this case.

Anant Gupta · Accepted Answer · 2020-03-26 11:15:48Z

0

Since the subroutine is purely functional in nature ( Each row's processing is not affected by the other row ) we can leverage multithreading to make it run faster

You can use the following

Command Prompt : pip install swifter

import swifter
df_final['State']  = df_input1['geolocations'].swifter.apply(convert_to_state)

answered Mar 26, 2020 at 11:15

Anant Gupta

1,16912 silver badges11 bronze badges

Collectives™ on Stack Overflow

applying custom function in pandas dataframe

2 Answers 2

Here is a very stupid example

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Here is a very stupid example

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related