Find indices of large array if it contains values in smaller array

Question

Is there a fast numpy function for returning a list of indices in a larger array where it matches values from a smaller array? Smaller array is ~ 30M values and larger is 800M so I want to avoid a for-loop of numpy.where calls.

The problem with searchsorted is that it will return results even when their is not an exact match, it just gives the closest index, but I only want indices where there are exact matches

instead of this:

>>> a = array([1,2,3,4,5])
>>> b = array([2,4,7])
>>> searchsorted(a,b)
array([1, 3, 5])

I would want this:

>>> a = array([1,2,3,4,5])
>>> b = array([2,4,7])
>>> SOMEFUNCTION(a,b)
array([1, 3])

EDIT: the set of values in both the smaller and larger arrays are always unique and sorted.

Community · Accepted Answer · 2017-05-23 12:26:39Z

9

You could use np.in1d to find those elements of a which are in b. To find the index, use one call to np.where:

In [34]: a = array([1,2,3,4,5])

In [35]: b = array([2,4,7])

In [36]: np.in1d(a, b)
Out[38]: array([False,  True, False,  True, False], dtype=bool)

In [39]: np.where(np.in1d(a, b))
Out[39]: (array([1, 3]),)

Because a and b are already sorted, you could use

In [57]: np.searchsorted(b, a, side='right') != np.searchsorted(b, a, side='left')
Out[57]: array([False,  True, False,  True, False], dtype=bool)

instead of np.in1d(a, b). For large a and b, using searchsorted may be faster:

import numpy as np
a = np.random.choice(10**7, size=10**6, replace=False)
a.sort()
b = np.random.choice(10**7, size=10**5, replace=False)
b.sort()

In [53]: %timeit np.in1d(a, b)
10 loops, best of 3: 176 ms per loop

In [54]: %timeit np.searchsorted(b, a, side='right') != np.searchsorted(b, a, side='left')
10 loops, best of 3: 106 ms per loop

Jaime and Divakar have suggested some significant improvements on the method shown above. Here is some code which tests that the methods all return the same result, followed by some benchmarks:

import numpy as np

a = np.random.choice(10**7, size=10**6, replace=False)
a.sort()
b = np.random.choice(10**7, size=10**5, replace=False)
b.sort()

def using_searchsorted(a, b):
    return (np.where(np.searchsorted(b, a, side='right') 
                     != np.searchsorted(b, a, side='left')))[0]

def using_in1d(a, b):
    return np.where(np.in1d(a, b))[0]

def using_searchsorted_divakar(a, b):
    idx1 = np.searchsorted(a,b,'left')
    idx2 = np.searchsorted(a,b,'right')
    out = idx1[idx1 != idx2]
    return out

def using_jaime_mask(haystack, needle):
    idx = np.searchsorted(haystack, needle)
    mask = idx < haystack.size
    mask[mask] = haystack[idx[mask]] == needle[mask]
    idx = idx[mask]
    return idx

expected = using_searchsorted(a, b)
for func in (using_in1d, using_searchsorted_divakar, using_jaime_mask):
    result = func(a, b)
    assert np.allclose(expected, result)

In [29]: %timeit using_jaime_mask(a, b)
100 loops, best of 3: 13 ms per loop

In [28]: %timeit using_searchsorted_divakar(a, b)
10 loops, best of 3: 21.7 ms per loop

In [26]: %timeit using_searchsorted(a, b)
10 loops, best of 3: 109 ms per loop

In [27]: %timeit using_in1d(a, b)
10 loops, best of 3: 173 ms per loop

edited May 23, 2017 at 12:26

CommunityBot

11 silver badge

answered Aug 3, 2015 at 13:57

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Divakar Over a year ago

Smart stuff with the left-right use of searchsorted! Guess you need to index into np.searchsorted(b, a, side='left') with the mask though to get the actual indices.

unutbu Over a year ago

You would still need to use np.where to get the indices with respect to a.

Divakar Over a year ago

I meant something like : idx1 = np.searchsorted(a,b,'left'); idx2 = np.searchsorted(a,b,'right'); out = idx1[idx1 != idx2]. Would work maybe?

unutbu Over a year ago

@Divakar: That looks to be about 5x faster than my suggested solution (on the larger test case). Would you like to write it up?

Divakar Over a year ago

Appreciate the motivation there! Largely inspired by your solution I must say. Just posted the solution here.

|

Community · Accepted Answer · 2017-05-23 10:31:02Z

6

The default seacrh direction with np.searchsorted is left. We can also search for it from the right direction and the ones that are same in both these set of indices would be the ones to be avoided from the indices outputted from the left option to get the desired output. The motivation here is same as discussed in @unutbu's solution.

Thus, the implementation would look like this -

idx1 = np.searchsorted(a,b,'left')
idx2 = np.searchsorted(a,b,'right')
out = idx1[idx1 != idx2]

edited May 23, 2017 at 10:31

CommunityBot

11 silver badge

answered Aug 3, 2015 at 14:33

Divakar

222k19 gold badges273 silver badges374 bronze badges

2 Comments

unutbu Over a year ago

Using searchsorted(larger, smaller) instead of searchsorted(smaller, larger) and indexing idx1 instead of using np.where makes this a much smarter way to do it.

Jaime Over a year ago

You have to add extra handling, because searchsorted may return indices out of bounds, but a potentially quicker way of discarding non-matches, rather than re-searching with "right"... is to check if they match!

idx = np.searchsorted(haystack, needle); mask = idx < haystack.size; mask[mask] = haystack[idx[mask]] == needle[mask]; idx = idx[mask]

Collectives™ on Stack Overflow

Find indices of large array if it contains values in smaller array

2 Answers 2

6 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related