0

I have a dataframe:df_input1 having 10M rows. One of the column name is "geolocation". For all of the records, I have to find the state name from the geolocation and fill up a column "state" of another dataframe: df_final. For that I have created a function convert_to_state and using this like below:

 df_final['State']  = df_input1['geolocations'].apply(convert_to_state)

Is there any faster way to achieve this since this is taking lot of time.

sample data: df_input1

vehicle-no start                end                   geolocations
123        10/12/2019 09:00:12  10/12/2019 11:00:78   fghdrf3245@bafd
456        12/10/2019 06:09:12  10/10/2019 09:23:12   {098ddc76yhfbdb7877]

The custom function:

import reverse_geocoder as rg 
import polyline
def convert_to_state(geoloc):
    long_lat = polyline.decode(geoloc)[0]     
    state_name= rg.search(long_lat)[0]["admin1"]
    return state_name

2
  • Can you provide 1) sample data 2) convert_to_state function. Optimization will depend on the kind of operation you are doing. Also you can explore dask and swifter Commented Mar 26, 2020 at 10:38
  • @AnantGupta I have edited the question with sample dataset and the custom function. Commented Mar 26, 2020 at 10:49

2 Answers 2

1

I would suggest using numpy to make a vectorized function

import numpy as np
import pandas as pd
import reverse_geocoder as rg 
import polyline
def convert_to_state(geoloc):
    long_lat = polyline.decode(geoloc)[0]     
    state_name= rg.search(long_lat)[0]["admin1"]
    return state_name


convert_to_state = np.vectorize(convert_to_state) # vectorize the method

col = df_input1['geolocations'].values # A numpy array of the column
df_final['State']  = pd.Series(convert_to_state(col))

The vectorized function running on the numpy array would give a substantial boost, then you convert it back to pandas Series.

I would highly recommend timing this method and the normal .apply method with %timeit decorator in ipython and report the runtimes on smaller subsets

Here is a very stupid example

In [1]: import pandas as pd                               

In [2]: import numpy as np                                

In [3]: x = pd.DataFrame( 
   ...:     [ 
   ...:         [1,2,"Some.Text"], 
   ...:         [3,4,"More.Text"] 
   ...:     ], 
   ...:     columns = ["A","B", "C"] 
   ...: )                                                 

In [4]: x                                                 
Out[4]: 
   A  B          C
0  1  2  Some.Text
1  3  4  More.Text

In [5]: def foo_split(t): 
   ...:     return t.split(".")[0] 
   ...:                                                   

In [6]: %timeit y = x.C.apply(foo_split)                  
248 µs ± 4.09 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [7]: c = x.C.values # numpy array of the column        

In [8]: foo_split_vect = np.vectorize(foo_split)          

In [9]: %timeit z = pd.Series(foo_split_vect(c))          
159 µs ± 624 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

You may see that the speed basically doubles in this case.

Sign up to request clarification or add additional context in comments.

Comments

0

Since the subroutine is purely functional in nature ( Each row's processing is not affected by the other row ) we can leverage multithreading to make it run faster

You can use the following

Command Prompt : pip install swifter

import swifter
df_final['State']  = df_input1['geolocations'].swifter.apply(convert_to_state)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.