2

I have three arrays, such that:

Data_Arr = np.array([1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 5, 5, 5])
ID_Arr = np.array([1, 2, 3, 4, 5])
Value_Arr = np.array([0.1, 0.6, 0.3, 0.8, 0.2])

I want to create a new array which has the dimensions of Data, but where each element is from Values, using the index position in ID. So far I have this in a loop, but its very slow as my Data array is very large:

out = np.zeros_like(Data_Arr, dtype=np.float)

for i in range(len(Data_Arr)):
    out[i] = Values_Arr[ID_Arr==Data_Arr[I]]

is there a more pythonic way of doing this and avoiding this loop (doesn't have to use numpy)?

Actual data looks like:

Data_Arr = [ 852116  852116  852116 ... 1001816 1001816 1001816]
ID_Arr = [ 852116  852117  852118 ... 1001814 1001815 1001816]
Value_Arr = [1.5547194 1.5547196 1.5547197 ... 1.5536859 1.5536858 1.5536857]

shapes are:

Data_Arr = (4021165,)
ID_Arr = (149701,)
Value_Arr = (149701,)
1
  • I'm not going to offer this as an answer because it uses more memory and might not be any faster, but I note that d = dict(zip(ID_Arr, Value_Arr)); print([d[i] for i in Data_Arr]) would be equivalent (although not utilising numpy). Commented Jul 22, 2020 at 21:47

3 Answers 3

2

Since ID_Arr is sorted, we can directly use np.searchsorted and index Value_Arr with the result:

Value_Arr[np.searchsorted(ID_Arr, Data_Arr)]
array([0.1, 0.1, 0.1, 0.6, 0.6, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.8, 0.8,
       0.2, 0.2, 0.2])

If ID_Arr isn't sorted (note: in case there may be out of bounds indices, we should remove them, see divakar's answer):

s_ind = ID_Arr.argsort()
ss = np.searchsorted(ID_Arr, Data_Arr, sorter=s_ind)
out = Value_Arr[s_ind[ss]]

Checking with the arrays suggested by alaniwi:

Data_Arr = np.array([1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 5, 5, 5])
ID_Arr = array([2, 1, 3, 4, 5])
Value_Arr = np.array([0.6, 0.1, 0.3, 0.8, 0.2])

out_op = np.zeros_like(Data_Arr, dtype=np.float)
for i in range(len(Data_Arr)):
    out_op[i] = Value_Arr[ID_Arr==Data_Arr[i]]

s_ind = ID_Arr.argsort()
ss = np.searchsorted(ID_Arr, Data_Arr, sorter=s_ind)
out_answer = Value_Arr[s_ind[ss]]

np.array_equal(out_op, out_answer)
#True
Sign up to request clarification or add additional context in comments.

10 Comments

Can you make a way that does not depend on ID_Arr being sorted? For example, if you permute both ID_Arr and Value_Arr in the same way (for example, swap the first two elements in both cases), it should not change the result.
The output is still not agreeing with the code in the question when I try it with lists that have been permuted as described.
This is using the order that would sort ID_Arr for the searchsorted. Hence produces the same output for unordered ID_Arr. If you change Value_Arr, of course the output is different, we are using the result of searchsorted to index this array @alaniwi
I was swapping the first two elements of each of the two arrays: ID_Arr = array([2, 1, 3, 4, 5]) and Value_Arr = array([0.6, 0.1, 0.3, 0.8, 0.2]). Code in question still gives array([0.1, 0.1, 0.1, 0.6, 0.6, ...]) but yours gives `array([0.6, 0.6, 0.6, 0.1, 0.1, ...]).
@yatu You need to index back with argsort() indices I suppose.
|
1

Based off approaches from this post, here are the adaptations.

Approach #1

# https://stackoverflow.com/a/62658135/ @Divakar  
a,b,invalid_specifier = ID_Arr, Data_Arr, 0

sidx = a.argsort()
idx = np.searchsorted(a,b,sorter=sidx)

# Remove out of bounds indices as they wont be matches
idx[idx==len(a)] = 0

# Get traced back indices corresponding to original version of a
idx0 = sidx[idx]

# Mask out invalid ones with invalid_specifier and return
out = np.where(a[idx0]==b, Values_Arr[idx0], invalid_specifier)

Approach #2

Lookup based -

# https://stackoverflow.com/a/62658135/ @Divakar    
def find_indices_lookup(a,b,invalid_specifier=-1):
    # Setup array where we will assign ranged numbers
    N = max(a.max(), b.max())+1
    lookup = np.full(N, invalid_specifier)

    # We index into lookup with b to trace back the positions. Non matching ones
    # would have invalid_specifier values as wount had been indexed by ranged ones
    lookup[a] = np.arange(len(a))
    indices  = lookup[b]
    return indices                     

idx = find_indices_lookup(ID_Arr, Data_Arr)
out = np.where(idx!=-1, Values_Arr[idx], 0)

Faster/simpler variant

And a simplified and hopefully faster version would be a direct lookup of values -

a,b,invalid_specifier = ID_Arr, Data_Arr, 0

N = max(a.max(), b.max())+1
lookup = np.zeros(N, dtype=Values_Arr.dtype)
lookup[ID_Arr] = Values_Arr
out = lookup[Data_Arr]

If all values from ID_Arr are guaranteed to be in Data_Arr, we can use np.empty in place of np.zeros for the array-assignment and thus gain further perf. boost.

Comments

0

Looks like you want:

out = Value_Arr[ID_Arr[Data_Arr - 1] - 1]

Note that the - 1 are due to the fact that Python/Numpy is 0-based index.

1 Comment

This does not work in the general case -- it is making assumptions that ID_Arr values are a sequence of integers starting at 1.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.