5
>>> arr
array([[ 0., 10.,  0., ...,  0.,  0.,  0.],
           [ 0.,  4.,  0., ...,  6.,  0.,  9.],
           [ 0.,  0.,  0., ...,  0.,  0.,  0.],
           ..., 
           [ 0.,  0.,  0., ...,  0.,  0.,  0.],
           [ 0.,  0.,  0., ...,  2.,  0.,  0.],
           [ 0.,  0.,  0., ...,  0.,  3.,  0.]])

In the numpy array above, I would like to replace every value that matches the column country_codes in the dataframe (df_A) with the value from the column continent_codes in df_A. df_A looks like:

  country_codes   continent_codes
0              4      4
1              8      3
2             12      5
3             16      6
4             24      5

Right now, I loop through dataframe and replace using numpy indexing notation. Given that iterrows() tends to be slow, is there a more direct/vectorized way to do this?

for index, row in self.df_A.iterrows():
    arr[arr == row['country_codes']] = row['continent_codes']
1
  • 1
    Hmm, one method would be to construct a df from your array and then call map on each column: a = pd.DataFrame(arr) a.apply(lambda x: x.map(df_A.set_index('country_codes')['continent_codes']) or something like this Commented Dec 16, 2015 at 20:26

2 Answers 2

2

Approach #1 : One vectorized approach using np.searchsorted and np.in1d would be as listed below -

# Store country_codes and continent_codes column data for further usage
oldval = np.array(df['country_codes'])
newval = np.array(df['continent_codes'])

# Mask of elements to be changed
mask = np.in1d(arr,oldval)

# Indices for each match from oldval in arr
idx = np.searchsorted(oldval,arr.ravel()[mask])

# Using the mask put selective elements from continent_codes column into arr
arr.ravel()[mask] = newval[idx]

Sample run -

>>> arr   # Original 2D array
array([[23,  4, 23,  5,  8],
       [ 3,  6,  8,  5, 11],
       [16, 24, 15,  4, 10],
       [ 4, 16, 10,  8,  1]])
>>> df
   country_codes  continent_codes
0              4                4
1              8                3
2             12                5
3             16                6
4             24                5

>>> oldval = np.array(df['country_codes'])
>>> newval = np.array(df['continent_codes'])
>>> mask = np.in1d(arr,oldval)
>>> idx = np.searchsorted(oldval,arr.ravel()[mask])
>>> arr.ravel()[mask] = newval[idx]

>>> mask.reshape(arr.shape)  # Mask array depiciting which elements were updated
array([[False,  True, False, False,  True],
       [False, False,  True, False, False],
       [ True,  True, False,  True, False],
       [ True,  True, False,  True, False]], dtype=bool)
>>> arr  # Updated 2D array
array([[23,  4, 23,  5,  3],
       [ 3,  6,  3,  5, 11],
       [ 6,  5, 15,  4, 10],
       [ 4,  6, 10,  3,  1]])

Approach #2 : As a variant, you can also create the mask with a comparison between np.searchsorted(oldval,arr,'left') and np.searchsorted(oldval,arr,'right') as discussed in the solutions for this question and re-use np.searchsorted(oldval,arr,'left') again later on while putting values into arr for a more efficient solution, like so -

# Store country_codes and continent_codes column data for further usage
oldval = np.array(df['country_codes'])
newval = np.array(df['continent_codes'])

# Left and right indices for each match from oldval in arr
left_idx = np.searchsorted(oldval,arr,'left')
right_idx = np.searchsorted(oldval,arr,'right')

# Mask of elements to be changed
mask = left_idx!=right_idx

# Using the mask put selective elements from continent_codes column into arr
arr[mask] = newval[left_idx[mask]]

Runtime tests and verify outputs

Function definitions -

def original_app(arr,df):
    for index, row in df.iterrows():
        arr[arr == row['country_codes']] = row['continent_codes']

def vectorized_app1(arr,df):
    oldval = np.array(df['country_codes'])
    newval = np.array(df['continent_codes'])
    mask = np.in1d(arr,oldval)
    idx = np.searchsorted(oldval,arr.ravel()[mask])
    arr.ravel()[mask] = newval[idx]

def vectorized_app2(arr,df):
    oldval = np.array(df['country_codes'])
    newval = np.array(df['continent_codes'])
    left_idx = np.searchsorted(oldval,arr,'left')
    right_idx = np.searchsorted(oldval,arr,'right')
    mask = left_idx!=right_idx
    arr[mask] = newval[left_idx[mask]]

Verify outputs -

In [195]: # Input array
     ...: arr = np.random.randint(0,100000,(1000,1000))
     ...: 
     ...: # Setup input dataframe
     ...: N = 1000
     ...: oldvals = np.unique(np.random.randint(0,100000,N))
     ...: newvals = np.random.randint(0,9,(oldvals.size))
     ...: df=pd.DataFrame({'country_codes':oldvals,'continent_codes':newvals})
     ...: df = df.reindex_axis(sorted(df.columns)[::-1], axis=1)
     ...: 
     ...: # Make copies for input array for funcs to update them
     ...: arrc1 = arr.copy()
     ...: arrc2 = arr.copy()
     ...: arrc3 = arr.copy()
     ...: 

In [196]: # Verify outputs
     ...: original_app(arrc1,df)
     ...: vectorized_app1(arrc2,df)
     ...: vectorized_app2(arrc3,df)
     ...: 

In [197]: np.allclose(arrc1,arrc2)
Out[197]: True

In [198]: np.allclose(arrc1,arrc3)
Out[198]: True

Timings -

In [199]: # Make copies for input array for funcs to update them
     ...: arrc1 = arr.copy()
     ...: arrc2 = arr.copy()
     ...: arrc3 = arr.copy()
     ...: 

In [200]: %timeit original_app(arrc1,df)
1 loops, best of 3: 2.79 s per loop

In [201]: %timeit vectorized_app1(arrc2,df)
1 loops, best of 3: 360 ms per loop

In [202]: %timeit vectorized_app2(arrc3,df)
1 loops, best of 3: 213 ms per loop
Sign up to request clarification or add additional context in comments.

2 Comments

@user308827 Glad to help, was interesting problem!
I too have a similar array structure and replicating your code gives me the following error - index 193 is out of bounds for axis 0 with size 193, where 193 is the length of my dataframe. How to solve this?
1

with this data as exemple, with at most N countries,

N=10**5
values=np.random.randint(0,N,(1000,1000))
exemple={'country':np.arange(N//2),'continent':randint(1,5,N//2)}
df=pd.DataFrame.from_dict(exemple)

You can just do :

v=np.arange(N)
v[df['country']]=df['continent']
v.take(values,out=values)

probably not optimal, but efficient (20ms).

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.