8

I have a quite large 1d numpy array Xold with given values. These values shall be replaced according to the rule specified by a 2d numpy array Y: An example would be

Xold=np.array([0,1,2,3,4])
Y=np.array([[0,0],[1,100],[3,300],[4,400],[2,200]])

Whenever a value in Xold is identical to a value in Y[:,0], the new value in Xnew should be the corresponding value in Y[:,1]. This is accomplished by two nested for loops:

Xnew=np.zeros(len(Xold))
for i in range(len(Xold)):
for j in range(len(Y)):
    if Xold[i]==Y[j,0]:
        Xnew[i]=Y[j,1]

With the given example, this yields Xnew=[0,100,200,300,400]. However, for large data sets this procedure is quite slow. What is a faster and more elegant way to accomplish this task?

0

8 Answers 8

5

SELECTING THE FASTEST METHOD

Answers to this question provided a nice assortment of ways to replace elements in numpy array. Let's check, which one would be the quickest.

TL;DR: Numpy indexing is the winner

 def meth1(): # suggested by @Slam
    for old, new in Y:  
        Xold[Xold == old] = new

 def meth2(): # suggested by myself, convert y_dict = dict(Y) first
     [y_dict[i] if i in y_dict.keys() else i for i in Xold]

 def meth3(): # suggested by @Eelco Hoogendoom, import numpy_index as npi first
     npi.remap(Xold, keys=Y[:, 0], values=Y[:, 1])

 def meth4(): # suggested by @Brad Solomon, import pandas as pd first 
     pd.Series(Xold).map(pd.Series(Y[:, 1], index=Y[:, 0])).values

  # suggested by @jdehesa. create Xnew = Xold.copy() and index
  # idx = np.searchsorted(Xold, Y[:, 0]) first
  def meth5():             
     Xnew[idx] = Y[:, 1]

Not so surprising results

 In [39]: timeit.timeit(meth1, number=1000000)                                                                      
 Out[39]: 12.08

 In [40]: timeit.timeit(meth2, number=1000000)                                                                      
 Out[40]: 2.87

 In [38]: timeit.timeit(meth3, number=1000000)                                                                      
 Out[38]: 55.39

 In [12]: timeit.timeit(meth4, number=1000000)                                                                                      
 Out[12]: 256.84

 In [50]: timeit.timeit(meth5, number=1000000)                                                                                      
 Out[50]: 1.12

So, the good old list comprehension is the second fastest, and the winning approach is numpy indexing combined with searchsorted().

Sign up to request clarification or add additional context in comments.

2 Comments

What dataset are you testing it on?
Xold=np.array([0,1,2,3,4,4,4,0]) , Y=np.array([[0,0],[1,100],[3,300],[4,400],[2,200]])
4

We can use np.searchsorted for a generic case when the data in first column of Y is not necessarily sorted -

sidx = Y[:,0].argsort()
out = Y[sidx[np.searchsorted(Y[:,0], Xold, sorter=sidx)],1]

Sample run -

In [53]: Xold
Out[53]: array([14, 10, 12, 13, 11])

In [54]: Y
Out[54]: 
array([[ 10,   0],
       [ 11, 100],
       [ 13, 300],
       [ 14, 400],
       [ 12, 200]])

In [55]: sidx = Y[:,0].argsort()
    ...: out = Y[sidx[np.searchsorted(Y[:,0], Xold, sorter=sidx)],1]

In [56]: out
Out[56]: array([400,   0, 200, 300, 100])

If not all elements have corresponding mappings available, then we need to do a bit more of work, like so -

sidx = Y[:,0].argsort()
sorted_indx = np.searchsorted(Y[:,0], Xold, sorter=sidx)
sorted_indx[sorted_indx==len(sidx)] = len(sidx)-1
idx_out = sidx[sorted_indx]
out = Y[idx_out,1]
out[Y[idx_out,0]!=Xold] = 0 # NA values as 0s

5 Comments

This is good, although, assuming there exists the case where not all values have a mapping, I'm not sure if those should be set to 0/NA/... or left as they were in Xold. But I think that would just mean replacing the last 0 with Xold[Y[idx_out,0]!=Xold] so good solution in any case.
@jdehesa OP has Xnew=np.zeros(len(Xold)) for the output. So, that made sense to me to use the same.
this code didn't work for me: In [16]: sidx = Y[:,0].argsort() In [17]: out = Y[sidx[np.searchsorted(Y[:,0], Xold, sorter=sidx)],1] IndexError: index 5 is out of bounds for axis 1 with size 5
as it was pointed out by @MihaiAlexandruIonut this is because Xold in my example contains elements missing from Y. however, there was no initial limitation that this can't be the case.
@DanielKislyuk Hence, the generic solution at the end of my post.
3

Here is one possibility:

import numpy as np

Xold = np.array([0, 1, 2, 3, 4])
Y = np.array([[0, 0], [1, 100], [3, 300], [4, 400], [2, 200]])
# Check every X value against every Y first value
m = Xold == Y[:, 0, np.newaxis]
# Check which elements in X are among Y first values
# (so values that are not in Y are not replaced)
m_X = np.any(m, axis=0)
# Compute replacement
# Xold * (1 - m_X) are the non-replaced values
# np.sum(Y[:, 1, np.newaxis] * m, axis=0) * m_X are the replaced values
Xnew = Xold * (1 - m_X) + np.sum(Y[:, 1, np.newaxis] * m, axis=0) * m_X
print(Xnew)

Output:

[  0 100 200 300 400]

This method works for more or less every case (unsorted arrays, multiple repetitions of values in X, values in X not replaced, values in Y not replacing anything in X), except if you give two replacements for the same value in Y, which would be wrong anyway. However, its time and space complexity is the product of the sizes of X and Y. If your problem has additional constraints (data is sorted, no repetitions, etc.) it might be possible to do something better. For example, if X is sorted with no repeated elements and every value in Y replaces a value in X (like in your example), this would probably be faster:

import numpy as np

Xold = np.array([0, 1, 2, 3, 4])
Y = np.array([[0, 0], [1, 100], [3, 300], [4, 400], [2, 200]])
idx = np.searchsorted(Xold, Y[:, 0])
Xnew = Xold.copy()
Xnew[idx] = Y[:, 1]
print(Xnew)
# [  0 100 200 300 400]

Comments

2

First improvement you can do is to use numpy indexing, but you'll still have 1 loop:

for old, new in Y: 
    Xold[Xold == old] = new

1 Comment

This is incorrect. It returns [200 200 200 300 400 400] for my case X = np.array([0,1,2,3,4,4]); Y = np.array([[0,1],[1,2],[3,300],[4,400],[2,200]])
1

You can use slicing features in combination with argsort method.

Xnew = Y[Y[:,1].argsort()][:, 1][Xold] 

Output

array([  0, 100, 200, 300, 400])

3 Comments

this code didn't work for me. Xnew = Y[Y[:,1].argsort()][:, 1][Xold] IndexError: index 100 is out of bounds for axis 1 with size 5
@DanielKislyuk, this is because your Xold arrays contains indexes which are not present in Y array.
yep. could you point to the docs that specify how Y[Y[:,1].argsort()][:, 1][Xold] substitution works? can't get a grasp of it.
0

Solution with pd.Series.map()

If you're open to using the Pandas library, you can also do this in a vectorized way with .map():

>>> import pandas as pd
>>> pd.Series(Xold).map(pd.Series(Y[:, 1], index=Y[:, 0]))                                                                                                                                                                    
0      0
1    100
2    200
3    300
4    400
dtype: int64

>>> pd.Series(Xold).map(pd.Series(Y[:, 1], index=Y[:, 0])).values                                                                                                                                                            
array([  0, 100, 200, 300, 400])

For the signature a.map(b), a looks for its corresponding entries in the index of b, and maps to the respective values in b.

b here is pd.Series(Y[:, 1], index=Y[:, 0]), which uses the 0th column as the index and 1st column as the values that get mapped to.


Using pandas.core.algorithms directly

Under the hood, this will use .get_indexer() and the Cython-implemented take_1d():

indexer = mapper.index.get_indexer(values)
new_values = algorithms.take_1d(mapper._values, indexer)

Knowing that, if the arrays are really massive you could cut down some overhead like this:

from pandas.core import algorithms

indexer = pd.Index(Y[:, 0]).get_indexer(Xold)  
mapped = algorithms.take_1d(Y[:, 1], indexer)

Comments

0

The numpy_indexed package (disclaimer; I am its author) contains an efficient vectorized function that solves the general problem:

import numpy_indexed as npi
Xnew = npi.remap(Xold, keys=Y[:, 0], values=Y[:, 1])

That is, this will work for any dtype, or when keys and values to be replaced are themselves ndarrays, and you get a kwarg to specify how to react to missing elements.

Not sure how it compares to pandas performance-wise; but one of the design choices in this library is that performing elementary operations like this (or doing group-by etc) should not involve the creation of an entire new datatype like a Series or Table, which always bothered me about using pandas for this type of thing.

Comments

0

You could convert Y to dictionary with y = dict(Y) and then run the following list comprehension

[y[i] if i in y.keys() else i for i in Xold]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.