Index array using array of unique values

Question

I have three arrays, such that:

Data_Arr = np.array([1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 5, 5, 5])
ID_Arr = np.array([1, 2, 3, 4, 5])
Value_Arr = np.array([0.1, 0.6, 0.3, 0.8, 0.2])

I want to create a new array which has the dimensions of Data, but where each element is from Values, using the index position in ID. So far I have this in a loop, but its very slow as my Data array is very large:

out = np.zeros_like(Data_Arr, dtype=np.float)

for i in range(len(Data_Arr)):
    out[i] = Values_Arr[ID_Arr==Data_Arr[I]]

is there a more pythonic way of doing this and avoiding this loop (doesn't have to use numpy)?

Actual data looks like:

Data_Arr = [ 852116  852116  852116 ... 1001816 1001816 1001816]
ID_Arr = [ 852116  852117  852118 ... 1001814 1001815 1001816]
Value_Arr = [1.5547194 1.5547196 1.5547197 ... 1.5536859 1.5536858 1.5536857]

shapes are:

Data_Arr = (4021165,)
ID_Arr = (149701,)
Value_Arr = (149701,)

I'm not going to offer this as an answer because it uses more memory and might not be any faster, but I note that d = dict(zip(ID_Arr, Value_Arr)); print([d[i] for i in Data_Arr]) would be equivalent (although not utilising numpy). — alani
– alani, Commented Jul 22, 2020 at 21:47

yatu · Accepted Answer · 2020-07-22 22:13:58Z

2

Since ID_Arr is sorted, we can directly use np.searchsorted and index Value_Arr with the result:

Value_Arr[np.searchsorted(ID_Arr, Data_Arr)]
array([0.1, 0.1, 0.1, 0.6, 0.6, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.8, 0.8,
       0.2, 0.2, 0.2])

If ID_Arr isn't sorted (note: in case there may be out of bounds indices, we should remove them, see divakar's answer):

s_ind = ID_Arr.argsort()
ss = np.searchsorted(ID_Arr, Data_Arr, sorter=s_ind)
out = Value_Arr[s_ind[ss]]

Checking with the arrays suggested by alaniwi:

Data_Arr = np.array([1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 5, 5, 5])
ID_Arr = array([2, 1, 3, 4, 5])
Value_Arr = np.array([0.6, 0.1, 0.3, 0.8, 0.2])

out_op = np.zeros_like(Data_Arr, dtype=np.float)
for i in range(len(Data_Arr)):
    out_op[i] = Value_Arr[ID_Arr==Data_Arr[i]]

s_ind = ID_Arr.argsort()
ss = np.searchsorted(ID_Arr, Data_Arr, sorter=s_ind)
out_answer = Value_Arr[s_ind[ss]]

np.array_equal(out_op, out_answer)
#True

edited Jul 22, 2020 at 22:13

answered Jul 22, 2020 at 21:40

yatu

88.6k12 gold badges93 silver badges148 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

alani Over a year ago

Can you make a way that does not depend on ID_Arr being sorted? For example, if you permute both ID_Arr and Value_Arr in the same way (for example, swap the first two elements in both cases), it should not change the result.

alani Over a year ago

The output is still not agreeing with the code in the question when I try it with lists that have been permuted as described.

yatu Over a year ago

This is using the order that would sort ID_Arr for the searchsorted. Hence produces the same output for unordered ID_Arr. If you change Value_Arr, of course the output is different, we are using the result of searchsorted to index this array @alaniwi

alani Over a year ago

I was swapping the first two elements of each of the two arrays: ID_Arr = array([2, 1, 3, 4, 5]) and Value_Arr = array([0.6, 0.1, 0.3, 0.8, 0.2]). Code in question still gives array([0.1, 0.1, 0.1, 0.6, 0.6, ...]) but yours gives `array([0.6, 0.6, 0.6, 0.1, 0.1, ...]).

Divakar Over a year ago

@yatu You need to index back with argsort() indices I suppose.

|

Divakar · Accepted Answer · 2020-07-22 22:54:10Z

Based off approaches from this post, here are the adaptations.

Approach #1

# https://stackoverflow.com/a/62658135/ @Divakar  
a,b,invalid_specifier = ID_Arr, Data_Arr, 0

sidx = a.argsort()
idx = np.searchsorted(a,b,sorter=sidx)

# Remove out of bounds indices as they wont be matches
idx[idx==len(a)] = 0

# Get traced back indices corresponding to original version of a
idx0 = sidx[idx]

# Mask out invalid ones with invalid_specifier and return
out = np.where(a[idx0]==b, Values_Arr[idx0], invalid_specifier)

Approach #2

Lookup based -

# https://stackoverflow.com/a/62658135/ @Divakar    
def find_indices_lookup(a,b,invalid_specifier=-1):
    # Setup array where we will assign ranged numbers
    N = max(a.max(), b.max())+1
    lookup = np.full(N, invalid_specifier)

    # We index into lookup with b to trace back the positions. Non matching ones
    # would have invalid_specifier values as wount had been indexed by ranged ones
    lookup[a] = np.arange(len(a))
    indices  = lookup[b]
    return indices                     

idx = find_indices_lookup(ID_Arr, Data_Arr)
out = np.where(idx!=-1, Values_Arr[idx], 0)

Faster/simpler variant

And a simplified and hopefully faster version would be a direct lookup of values -

a,b,invalid_specifier = ID_Arr, Data_Arr, 0

N = max(a.max(), b.max())+1
lookup = np.zeros(N, dtype=Values_Arr.dtype)
lookup[ID_Arr] = Values_Arr
out = lookup[Data_Arr]

If all values from ID_Arr are guaranteed to be in Data_Arr, we can use np.empty in place of np.zeros for the array-assignment and thus gain further perf. boost.

Quang Hoang · Accepted Answer · 2020-07-22 21:24:04Z

0

Looks like you want:

out = Value_Arr[ID_Arr[Data_Arr - 1] - 1]

Note that the - 1 are due to the fact that Python/Numpy is 0-based index.

answered Jul 22, 2020 at 21:24

Quang Hoang

151k11 gold badges64 silver badges86 bronze badges

1 Comment

alani Over a year ago

This does not work in the general case -- it is making assumptions that ID_Arr values are a sequence of integers starting at 1.

Collectives™ on Stack Overflow

Index array using array of unique values

3 Answers 3

10 Comments

Approach #1

Approach #2

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

10 Comments

Approach #1

Approach #2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related