Replace elements in numpy array avoiding loops

Question

I have a quite large 1d numpy array Xold with given values. These values shall be replaced according to the rule specified by a 2d numpy array Y: An example would be

Xold=np.array([0,1,2,3,4])
Y=np.array([[0,0],[1,100],[3,300],[4,400],[2,200]])

Whenever a value in Xold is identical to a value in Y[:,0], the new value in Xnew should be the corresponding value in Y[:,1]. This is accomplished by two nested for loops:

Xnew=np.zeros(len(Xold))
for i in range(len(Xold)):
for j in range(len(Y)):
    if Xold[i]==Y[j,0]:
        Xnew[i]=Y[j,1]

With the given example, this yields Xnew=[0,100,200,300,400]. However, for large data sets this procedure is quite slow. What is a faster and more elegant way to accomplish this task?

Daniel Kislyuk · Accepted Answer · 2018-11-06 14:43:51Z

5

SELECTING THE FASTEST METHOD

Answers to this question provided a nice assortment of ways to replace elements in numpy array. Let's check, which one would be the quickest.

TL;DR: Numpy indexing is the winner

 def meth1(): # suggested by @Slam
    for old, new in Y:  
        Xold[Xold == old] = new

 def meth2(): # suggested by myself, convert y_dict = dict(Y) first
     [y_dict[i] if i in y_dict.keys() else i for i in Xold]

 def meth3(): # suggested by @Eelco Hoogendoom, import numpy_index as npi first
     npi.remap(Xold, keys=Y[:, 0], values=Y[:, 1])

 def meth4(): # suggested by @Brad Solomon, import pandas as pd first 
     pd.Series(Xold).map(pd.Series(Y[:, 1], index=Y[:, 0])).values

  # suggested by @jdehesa. create Xnew = Xold.copy() and index
  # idx = np.searchsorted(Xold, Y[:, 0]) first
  def meth5():             
     Xnew[idx] = Y[:, 1]

Not so surprising results

 In [39]: timeit.timeit(meth1, number=1000000)                                                                      
 Out[39]: 12.08

 In [40]: timeit.timeit(meth2, number=1000000)                                                                      
 Out[40]: 2.87

 In [38]: timeit.timeit(meth3, number=1000000)                                                                      
 Out[38]: 55.39

 In [12]: timeit.timeit(meth4, number=1000000)                                                                                      
 Out[12]: 256.84

 In [50]: timeit.timeit(meth5, number=1000000)                                                                                      
 Out[50]: 1.12

So, the good old list comprehension is the second fastest, and the winning approach is numpy indexing combined with searchsorted().

edited Nov 6, 2018 at 14:43

answered Nov 6, 2018 at 12:16

Daniel Kislyuk

1,00611 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Divakar Over a year ago

What dataset are you testing it on?

Daniel Kislyuk Over a year ago

Xold=np.array([0,1,2,3,4,4,4,0]) , Y=np.array([[0,0],[1,100],[3,300],[4,400],[2,200]])

Divakar · Accepted Answer · 2018-11-05 14:30:12Z

4

We can use np.searchsorted for a generic case when the data in first column of Y is not necessarily sorted -

sidx = Y[:,0].argsort()
out = Y[sidx[np.searchsorted(Y[:,0], Xold, sorter=sidx)],1]

Sample run -

In [53]: Xold
Out[53]: array([14, 10, 12, 13, 11])

In [54]: Y
Out[54]: 
array([[ 10,   0],
       [ 11, 100],
       [ 13, 300],
       [ 14, 400],
       [ 12, 200]])

In [55]: sidx = Y[:,0].argsort()
    ...: out = Y[sidx[np.searchsorted(Y[:,0], Xold, sorter=sidx)],1]

In [56]: out
Out[56]: array([400,   0, 200, 300, 100])

If not all elements have corresponding mappings available, then we need to do a bit more of work, like so -

sidx = Y[:,0].argsort()
sorted_indx = np.searchsorted(Y[:,0], Xold, sorter=sidx)
sorted_indx[sorted_indx==len(sidx)] = len(sidx)-1
idx_out = sidx[sorted_indx]
out = Y[idx_out,1]
out[Y[idx_out,0]!=Xold] = 0 # NA values as 0s

edited Nov 5, 2018 at 14:30

answered Nov 5, 2018 at 14:18

Divakar

222k19 gold badges273 silver badges374 bronze badges

5 Comments

javidcf Over a year ago

This is good, although, assuming there exists the case where not all values have a mapping, I'm not sure if those should be set to 0/NA/... or left as they were in Xold. But I think that would just mean replacing the last 0 with Xold[Y[idx_out,0]!=Xold] so good solution in any case.

Divakar Over a year ago

@jdehesa OP has Xnew=np.zeros(len(Xold)) for the output. So, that made sense to me to use the same.

Daniel Kislyuk Over a year ago

this code didn't work for me: In [16]: sidx = Y[:,0].argsort() In [17]: out = Y[sidx[np.searchsorted(Y[:,0], Xold, sorter=sidx)],1] IndexError: index 5 is out of bounds for axis 1 with size 5

Daniel Kislyuk Over a year ago

as it was pointed out by @MihaiAlexandruIonut this is because Xold in my example contains elements missing from Y. however, there was no initial limitation that this can't be the case.

Divakar Over a year ago

@DanielKislyuk Hence, the generic solution at the end of my post.

javidcf · Accepted Answer · 2018-11-05 14:13:50Z

Here is one possibility:

import numpy as np

Xold = np.array([0, 1, 2, 3, 4])
Y = np.array([[0, 0], [1, 100], [3, 300], [4, 400], [2, 200]])
# Check every X value against every Y first value
m = Xold == Y[:, 0, np.newaxis]
# Check which elements in X are among Y first values
# (so values that are not in Y are not replaced)
m_X = np.any(m, axis=0)
# Compute replacement
# Xold * (1 - m_X) are the non-replaced values
# np.sum(Y[:, 1, np.newaxis] * m, axis=0) * m_X are the replaced values
Xnew = Xold * (1 - m_X) + np.sum(Y[:, 1, np.newaxis] * m, axis=0) * m_X
print(Xnew)

Output:

[  0 100 200 300 400]

This method works for more or less every case (unsorted arrays, multiple repetitions of values in X, values in X not replaced, values in Y not replacing anything in X), except if you give two replacements for the same value in Y, which would be wrong anyway. However, its time and space complexity is the product of the sizes of X and Y. If your problem has additional constraints (data is sorted, no repetitions, etc.) it might be possible to do something better. For example, if X is sorted with no repeated elements and every value in Y replaces a value in X (like in your example), this would probably be faster:

import numpy as np

Xold = np.array([0, 1, 2, 3, 4])
Y = np.array([[0, 0], [1, 100], [3, 300], [4, 400], [2, 200]])
idx = np.searchsorted(Xold, Y[:, 0])
Xnew = Xold.copy()
Xnew[idx] = Y[:, 1]
print(Xnew)
# [  0 100 200 300 400]

Slam · Accepted Answer · 2018-11-05 14:01:52Z

2

First improvement you can do is to use numpy indexing, but you'll still have 1 loop:

for old, new in Y: 
    Xold[Xold == old] = new

answered Nov 5, 2018 at 14:01

Slam

8,6621 gold badge40 silver badges46 bronze badges

1 Comment

mathfux Over a year ago

This is incorrect. It returns [200 200 200 300 400 400] for my case X = np.array([0,1,2,3,4,4]); Y = np.array([[0,1],[1,2],[3,300],[4,400],[2,200]])

Mihai Alexandru-Ionut · Accepted Answer · 2018-11-05 14:18:29Z

1

You can use slicing features in combination with argsort method.

Xnew = Y[Y[:,1].argsort()][:, 1][Xold]

Output

array([  0, 100, 200, 300, 400])

answered Nov 5, 2018 at 14:18

Mihai Alexandru-Ionut

48.6k14 gold badges106 silver badges132 bronze badges

3 Comments

Daniel Kislyuk Over a year ago

this code didn't work for me. Xnew = Y[Y[:,1].argsort()][:, 1][Xold] IndexError: index 100 is out of bounds for axis 1 with size 5

Mihai Alexandru-Ionut Over a year ago

@DanielKislyuk, this is because your Xold arrays contains indexes which are not present in Y array.

Daniel Kislyuk Over a year ago

yep. could you point to the docs that specify how Y[Y[:,1].argsort()][:, 1][Xold] substitution works? can't get a grasp of it.

Brad Solomon · Accepted Answer · 2018-11-05 14:31:41Z

Solution with `pd.Series.map()`

If you're open to using the Pandas library, you can also do this in a vectorized way with .map():

>>> import pandas as pd
>>> pd.Series(Xold).map(pd.Series(Y[:, 1], index=Y[:, 0]))                                                                                                                                                                    
0      0
1    100
2    200
3    300
4    400
dtype: int64

>>> pd.Series(Xold).map(pd.Series(Y[:, 1], index=Y[:, 0])).values                                                                                                                                                            
array([  0, 100, 200, 300, 400])

For the signature a.map(b), a looks for its corresponding entries in the index of b, and maps to the respective values in b.

b here is pd.Series(Y[:, 1], index=Y[:, 0]), which uses the 0th column as the index and 1st column as the values that get mapped to.

Using `pandas.core.algorithms` directly

Under the hood, this will use .get_indexer() and the Cython-implemented take_1d():

indexer = mapper.index.get_indexer(values)
new_values = algorithms.take_1d(mapper._values, indexer)

Knowing that, if the arrays are really massive you could cut down some overhead like this:

from pandas.core import algorithms

indexer = pd.Index(Y[:, 0]).get_indexer(Xold)  
mapped = algorithms.take_1d(Y[:, 1], indexer)

Eelco Hoogendoorn · Accepted Answer · 2018-11-05 14:41:44Z

0

The numpy_indexed package (disclaimer; I am its author) contains an efficient vectorized function that solves the general problem:

import numpy_indexed as npi
Xnew = npi.remap(Xold, keys=Y[:, 0], values=Y[:, 1])

That is, this will work for any dtype, or when keys and values to be replaced are themselves ndarrays, and you get a kwarg to specify how to react to missing elements.

Not sure how it compares to pandas performance-wise; but one of the design choices in this library is that performing elementary operations like this (or doing group-by etc) should not involve the creation of an entire new datatype like a Series or Table, which always bothered me about using pandas for this type of thing.

answered Nov 5, 2018 at 14:41

Eelco Hoogendoorn

10.8k1 gold badge46 silver badges43 bronze badges

Comments

Daniel Kislyuk · Accepted Answer · 2018-11-05 14:41:54Z

0

You could convert Y to dictionary with y = dict(Y) and then run the following list comprehension

[y[i] if i in y.keys() else i for i in Xold]

answered Nov 5, 2018 at 14:41

Daniel Kislyuk

1,00611 silver badges12 bronze badges

Collectives™ on Stack Overflow

Replace elements in numpy array avoiding loops

8 Answers 8

2 Comments

5 Comments

Comments

1 Comment

3 Comments

Solution with `pd.Series.map()`

Using `pandas.core.algorithms` directly

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

2 Comments

5 Comments

Comments

1 Comment

3 Comments

Solution with pd.Series.map()

Using pandas.core.algorithms directly

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Solution with `pd.Series.map()`

Using `pandas.core.algorithms` directly