Replace values in numpy 2D array based on pandas dataframe

Question

>>> arr
array([[ 0., 10.,  0., ...,  0.,  0.,  0.],
           [ 0.,  4.,  0., ...,  6.,  0.,  9.],
           [ 0.,  0.,  0., ...,  0.,  0.,  0.],
           ..., 
           [ 0.,  0.,  0., ...,  0.,  0.,  0.],
           [ 0.,  0.,  0., ...,  2.,  0.,  0.],
           [ 0.,  0.,  0., ...,  0.,  3.,  0.]])

In the numpy array above, I would like to replace every value that matches the column country_codes in the dataframe (df_A) with the value from the column continent_codes in df_A. df_A looks like:

  country_codes   continent_codes
0              4      4
1              8      3
2             12      5
3             16      6
4             24      5

Right now, I loop through dataframe and replace using numpy indexing notation. Given that iterrows() tends to be slow, is there a more direct/vectorized way to do this?

for index, row in self.df_A.iterrows():
    arr[arr == row['country_codes']] = row['continent_codes']

Hmm, one method would be to construct a df from your array and then call map on each column: a = pd.DataFrame(arr) a.apply(lambda x: x.map(df_A.set_index('country_codes')['continent_codes']) or something like this — EdChum
– EdChum, Commented Dec 16, 2015 at 20:26

Community · Accepted Answer · 2017-05-23 12:04:04Z

Approach #1 : One vectorized approach using np.searchsorted and np.in1d would be as listed below -

# Store country_codes and continent_codes column data for further usage
oldval = np.array(df['country_codes'])
newval = np.array(df['continent_codes'])

# Mask of elements to be changed
mask = np.in1d(arr,oldval)

# Indices for each match from oldval in arr
idx = np.searchsorted(oldval,arr.ravel()[mask])

# Using the mask put selective elements from continent_codes column into arr
arr.ravel()[mask] = newval[idx]

Sample run -

>>> arr   # Original 2D array
array([[23,  4, 23,  5,  8],
       [ 3,  6,  8,  5, 11],
       [16, 24, 15,  4, 10],
       [ 4, 16, 10,  8,  1]])
>>> df
   country_codes  continent_codes
0              4                4
1              8                3
2             12                5
3             16                6
4             24                5

>>> oldval = np.array(df['country_codes'])
>>> newval = np.array(df['continent_codes'])
>>> mask = np.in1d(arr,oldval)
>>> idx = np.searchsorted(oldval,arr.ravel()[mask])
>>> arr.ravel()[mask] = newval[idx]

>>> mask.reshape(arr.shape)  # Mask array depiciting which elements were updated
array([[False,  True, False, False,  True],
       [False, False,  True, False, False],
       [ True,  True, False,  True, False],
       [ True,  True, False,  True, False]], dtype=bool)
>>> arr  # Updated 2D array
array([[23,  4, 23,  5,  3],
       [ 3,  6,  3,  5, 11],
       [ 6,  5, 15,  4, 10],
       [ 4,  6, 10,  3,  1]])

Approach #2 : As a variant, you can also create the mask with a comparison between np.searchsorted(oldval,arr,'left') and np.searchsorted(oldval,arr,'right') as discussed in the solutions for this question and re-use np.searchsorted(oldval,arr,'left') again later on while putting values into arr for a more efficient solution, like so -

# Store country_codes and continent_codes column data for further usage
oldval = np.array(df['country_codes'])
newval = np.array(df['continent_codes'])

# Left and right indices for each match from oldval in arr
left_idx = np.searchsorted(oldval,arr,'left')
right_idx = np.searchsorted(oldval,arr,'right')

# Mask of elements to be changed
mask = left_idx!=right_idx

# Using the mask put selective elements from continent_codes column into arr
arr[mask] = newval[left_idx[mask]]

Runtime tests and verify outputs

Function definitions -

def original_app(arr,df):
    for index, row in df.iterrows():
        arr[arr == row['country_codes']] = row['continent_codes']

def vectorized_app1(arr,df):
    oldval = np.array(df['country_codes'])
    newval = np.array(df['continent_codes'])
    mask = np.in1d(arr,oldval)
    idx = np.searchsorted(oldval,arr.ravel()[mask])
    arr.ravel()[mask] = newval[idx]

def vectorized_app2(arr,df):
    oldval = np.array(df['country_codes'])
    newval = np.array(df['continent_codes'])
    left_idx = np.searchsorted(oldval,arr,'left')
    right_idx = np.searchsorted(oldval,arr,'right')
    mask = left_idx!=right_idx
    arr[mask] = newval[left_idx[mask]]

Verify outputs -

In [195]: # Input array
     ...: arr = np.random.randint(0,100000,(1000,1000))
     ...: 
     ...: # Setup input dataframe
     ...: N = 1000
     ...: oldvals = np.unique(np.random.randint(0,100000,N))
     ...: newvals = np.random.randint(0,9,(oldvals.size))
     ...: df=pd.DataFrame({'country_codes':oldvals,'continent_codes':newvals})
     ...: df = df.reindex_axis(sorted(df.columns)[::-1], axis=1)
     ...: 
     ...: # Make copies for input array for funcs to update them
     ...: arrc1 = arr.copy()
     ...: arrc2 = arr.copy()
     ...: arrc3 = arr.copy()
     ...: 

In [196]: # Verify outputs
     ...: original_app(arrc1,df)
     ...: vectorized_app1(arrc2,df)
     ...: vectorized_app2(arrc3,df)
     ...: 

In [197]: np.allclose(arrc1,arrc2)
Out[197]: True

In [198]: np.allclose(arrc1,arrc3)
Out[198]: True

Timings -

In [199]: # Make copies for input array for funcs to update them
     ...: arrc1 = arr.copy()
     ...: arrc2 = arr.copy()
     ...: arrc3 = arr.copy()
     ...: 

In [200]: %timeit original_app(arrc1,df)
1 loops, best of 3: 2.79 s per loop

In [201]: %timeit vectorized_app1(arrc2,df)
1 loops, best of 3: 360 ms per loop

In [202]: %timeit vectorized_app2(arrc3,df)
1 loops, best of 3: 213 ms per loop

I too have a similar array structure and replicating your code gives me the following error - index 193 is out of bounds for axis 0 with size 193, where 193 is the length of my dataframe. How to solve this?

B. M. · Accepted Answer · 2015-12-17 17:05:08Z

1

with this data as exemple, with at most N countries,

N=10**5
values=np.random.randint(0,N,(1000,1000))
exemple={'country':np.arange(N//2),'continent':randint(1,5,N//2)}
df=pd.DataFrame.from_dict(exemple)

You can just do :

v=np.arange(N)
v[df['country']]=df['continent']
v.take(values,out=values)

probably not optimal, but efficient (20ms).

edited Dec 17, 2015 at 17:05

answered Dec 17, 2015 at 7:26

B. M.

18.7k2 gold badges40 silver badges56 bronze badges

Collectives™ on Stack Overflow

Replace values in numpy 2D array based on pandas dataframe

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related