Checking for and indexing non-unique/duplicate values in a numpy array

Question

I have an array traced_descIDs containing object IDs and I want to identify which items are not unique in this array. Then, for each unique duplicate (careful) ID, I need to identify which indices of traced_descIDs are associated with it.

As an example, if we take the traced_descIDs here, I want the following process to occur:

traced_descIDs = [1, 345, 23, 345, 90, 1]
dupIds = [1, 345]
dupInds = [[0,5],[1,3]]

I'm currently finding out which objects have more than 1 entry by:

mentions = np.array([len(np.argwhere( traced_descIDs == i)) for i in traced_descIDs])
dupMask = (mentions > 1)

however, this takes too long as len( traced_descIDs ) is around 150,000. Is there a faster way to achieve the same result?

Any help greatly appreciated. Cheers.

Jaime · Accepted Answer · 2014-08-12 18:18:06Z

13

While dictionaries are O(n), the overhead of Python objects sometimes makes it more convenient to use numpy's functions, which use sorting and are O(n*log n). In your case, the starting point would be:

a = [1, 345, 23, 345, 90, 1]
unq, unq_idx, unq_cnt = np.unique(a, return_inverse=True, return_counts=True)

If you are using a version of numpy earlier than 1.9, then that last line would have to be:

unq, unq_idx = np.unique(a, return_inverse=True)
unq_cnt = np.bincount(unq_idx)

The contents of the three arrays we have created are:

>>> unq
array([  1,  23,  90, 345])
>>> unq_idx
array([0, 3, 1, 3, 2, 0])
>>> unq_cnt
array([2, 1, 1, 2])

To get the repeated items:

cnt_mask = unq_cnt > 1
dup_ids = unq[cnt_mask]

>>> dup_ids
array([  1, 345])

Getting the indices is a little more involved, but pretty straightforward:

cnt_idx, = np.nonzero(cnt_mask)
idx_mask = np.in1d(unq_idx, cnt_idx)
idx_idx, = np.nonzero(idx_mask)
srt_idx = np.argsort(unq_idx[idx_mask])
dup_idx = np.split(idx_idx[srt_idx], np.cumsum(unq_cnt[cnt_mask])[:-1])

>>> dup_idx
[array([0, 5]), array([1, 3])]

edited Aug 12, 2014 at 18:18

answered Aug 12, 2014 at 14:10

Jaime

67.7k19 gold badges128 silver badges164 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Carl M Over a year ago

I'm much more comfortable with this answer and it doesn't appear to take much longer than the dictionary answer above. Thank you for your time.

behzad.nouri · Accepted Answer · 2014-08-12 13:20:28Z

5

There is scipy.stats.itemfreq which would give the frequency of each item:

>>> xs = np.array([1, 345, 23, 345, 90, 1])
>>> ifreq = sp.stats.itemfreq(xs)
>>> ifreq
array([[  1,   2],
       [ 23,   1],
       [ 90,   1],
       [345,   2]])
>>> [(xs == w).nonzero()[0] for w in ifreq[ifreq[:,1] > 1, 0]]
[array([0, 5]), array([1, 3])]

answered Aug 12, 2014 at 13:20

behzad.nouri

78.5k18 gold badges130 silver badges127 bronze badges

1 Comment

Carl M Over a year ago

I didn't know about this function. Thanks for bringing it to my attention.

Ashwini Chaudhary · Accepted Answer · 2014-08-12 14:02:56Z

3

Your current approach is O(N**2), use a dictionary to do it in O(N)time:

>>> from collections import defaultdict
>>> traced_descIDs = [1, 345, 23, 345, 90, 1]
>>> d = defaultdict(list)
>>> for i, x in enumerate(traced_descIDs):
...     d[x].append(i)
...     
>>> for k, v in d.items():
...     if len(v) == 1:
...         del d[k]
...         
>>> d
defaultdict(<type 'list'>, {1: [0, 5], 345: [1, 3]})

And to get the items and indices:

>>> from itertools import izip
>>> dupIds, dupInds = izip(*d.iteritems())
>>> dupIds, dupInds
((1, 345), ([0, 5], [1, 3]))

Note that if you want to preserver the order of items in dupIds then use collections.OrderedDict and dict.setdefault() method.

edited Aug 12, 2014 at 14:02

answered Aug 12, 2014 at 12:58

Ashwini Chaudhary

252k60 gold badges478 silver badges519 bronze badges

4 Comments

Eelco Hoogendoorn Over a year ago

Personally I would prefer the numpy solution, but if you want to go this way, the standard library has you covered: from collections import Counter

Carl M Over a year ago

Could you expand on exactly what isn't preserved by not using OrderedDict?

Eelco Hoogendoorn Over a year ago

Note that this solution will create a lot of python objects, hence memory useage will explode; that's why staying within numpy is probably preferable, if you are dealing with large datasets.

Ashwini Chaudhary Over a year ago

@CarlM The output here could have been [345, 1], as dicts have no order. An OrderedDict will make sure the output is [1, 345].`

John Zwinck · Accepted Answer · 2014-08-12 13:05:12Z

2

td = np.array(traced_descIDs)
si = np.argsort(td)
td[si][np.append(False, np.diff(td[si]) == 0)]

That gives you:

array([  1, 345])

I haven't figured out the second part quite yet, but maybe this will be inspiration enough for you, or maybe I'll get back to it. :)

answered Aug 12, 2014 at 13:05

John Zwinck

252k44 gold badges346 silver badges459 bronze badges

Comments

Eelco Hoogendoorn · Accepted Answer · 2016-04-02 19:43:37Z

0

A solution of the same vectorized efficiency as proposed by Jaime is embedded in the numpy_indexed package (disclaimer: I am its author):

import numpy_indexed as npi
print(npi.group_by(traced_descIDs, np.arange(len(traced_descIDs))))

This gets us most of the way there; but if we also want to filter out singleton groups while avoiding any python loops and staying entirely vectorized, we can go a little lower level, and do:

g = npi.group_by(traced_descIDs)
unique = g.unique
idx = g.split_array_as_list(np.arange(len(traced_descIDs)))
duplicates = unique[g.count>1]
idx_duplicates = np.asarray(idx)[g.count>1]
print(duplicates, idx_duplicates)

edited Apr 2, 2016 at 19:43

answered Apr 2, 2016 at 18:07

Eelco Hoogendoorn

10.8k1 gold badge46 silver badges43 bronze badges

Comments

A. West · Accepted Answer · 2020-03-26 11:38:27Z

0

`np.unqiue` for Ndims

I had a similar problem with an ndArray in which I want to find which rows are duplicated.

x = np.arange(60).reshape(5,4,3)
x[1] = x[0]

0 and 1 should be duplicates in axis 0. I used np.unique and returned all options. Then use Jaime's method to locate the duplicates.

_,i,_,c = np.unique(x,1,1,1,axis=0)
x_dup = x[i[1<c]]

I unnecessarily use return_inverse for clarity. Here are the result:

>>> print(x_dupilates)
[[[ 0  1  2]
  [ 3  4  5]
  [ 6  7  8]
  [ 9 10 11]]]

answered Mar 26, 2020 at 11:38

A. West

7048 silver badges13 bronze badges

Collectives™ on Stack Overflow

Checking for and indexing non-unique/duplicate values in a numpy array

6 Answers 6

1 Comment

1 Comment

4 Comments

Comments

Comments

`np.unqiue` for Ndims

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

1 Comment

1 Comment

4 Comments

Comments

Comments

np.unqiue for Ndims

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

`np.unqiue` for Ndims