6

I have an array traced_descIDs containing object IDs and I want to identify which items are not unique in this array. Then, for each unique duplicate (careful) ID, I need to identify which indices of traced_descIDs are associated with it.

As an example, if we take the traced_descIDs here, I want the following process to occur:

traced_descIDs = [1, 345, 23, 345, 90, 1]
dupIds = [1, 345]
dupInds = [[0,5],[1,3]]

I'm currently finding out which objects have more than 1 entry by:

mentions = np.array([len(np.argwhere( traced_descIDs == i)) for i in traced_descIDs])
dupMask = (mentions > 1)

however, this takes too long as len( traced_descIDs ) is around 150,000. Is there a faster way to achieve the same result?

Any help greatly appreciated. Cheers.

6 Answers 6

13

While dictionaries are O(n), the overhead of Python objects sometimes makes it more convenient to use numpy's functions, which use sorting and are O(n*log n). In your case, the starting point would be:

a = [1, 345, 23, 345, 90, 1]
unq, unq_idx, unq_cnt = np.unique(a, return_inverse=True, return_counts=True)

If you are using a version of numpy earlier than 1.9, then that last line would have to be:

unq, unq_idx = np.unique(a, return_inverse=True)
unq_cnt = np.bincount(unq_idx)

The contents of the three arrays we have created are:

>>> unq
array([  1,  23,  90, 345])
>>> unq_idx
array([0, 3, 1, 3, 2, 0])
>>> unq_cnt
array([2, 1, 1, 2])

To get the repeated items:

cnt_mask = unq_cnt > 1
dup_ids = unq[cnt_mask]

>>> dup_ids
array([  1, 345])

Getting the indices is a little more involved, but pretty straightforward:

cnt_idx, = np.nonzero(cnt_mask)
idx_mask = np.in1d(unq_idx, cnt_idx)
idx_idx, = np.nonzero(idx_mask)
srt_idx = np.argsort(unq_idx[idx_mask])
dup_idx = np.split(idx_idx[srt_idx], np.cumsum(unq_cnt[cnt_mask])[:-1])

>>> dup_idx
[array([0, 5]), array([1, 3])]
Sign up to request clarification or add additional context in comments.

1 Comment

I'm much more comfortable with this answer and it doesn't appear to take much longer than the dictionary answer above. Thank you for your time.
5

There is scipy.stats.itemfreq which would give the frequency of each item:

>>> xs = np.array([1, 345, 23, 345, 90, 1])
>>> ifreq = sp.stats.itemfreq(xs)
>>> ifreq
array([[  1,   2],
       [ 23,   1],
       [ 90,   1],
       [345,   2]])
>>> [(xs == w).nonzero()[0] for w in ifreq[ifreq[:,1] > 1, 0]]
[array([0, 5]), array([1, 3])]

1 Comment

I didn't know about this function. Thanks for bringing it to my attention.
3

Your current approach is O(N**2), use a dictionary to do it in O(N)time:

>>> from collections import defaultdict
>>> traced_descIDs = [1, 345, 23, 345, 90, 1]
>>> d = defaultdict(list)
>>> for i, x in enumerate(traced_descIDs):
...     d[x].append(i)
...     
>>> for k, v in d.items():
...     if len(v) == 1:
...         del d[k]
...         
>>> d
defaultdict(<type 'list'>, {1: [0, 5], 345: [1, 3]})

And to get the items and indices:

>>> from itertools import izip
>>> dupIds, dupInds = izip(*d.iteritems())
>>> dupIds, dupInds
((1, 345), ([0, 5], [1, 3]))

Note that if you want to preserver the order of items in dupIds then use collections.OrderedDict and dict.setdefault() method.

4 Comments

Personally I would prefer the numpy solution, but if you want to go this way, the standard library has you covered: from collections import Counter
Could you expand on exactly what isn't preserved by not using OrderedDict?
Note that this solution will create a lot of python objects, hence memory useage will explode; that's why staying within numpy is probably preferable, if you are dealing with large datasets.
@CarlM The output here could have been [345, 1], as dicts have no order. An OrderedDict will make sure the output is [1, 345].`
2
td = np.array(traced_descIDs)
si = np.argsort(td)
td[si][np.append(False, np.diff(td[si]) == 0)]

That gives you:

array([  1, 345])

I haven't figured out the second part quite yet, but maybe this will be inspiration enough for you, or maybe I'll get back to it. :)

Comments

0

A solution of the same vectorized efficiency as proposed by Jaime is embedded in the numpy_indexed package (disclaimer: I am its author):

import numpy_indexed as npi
print(npi.group_by(traced_descIDs, np.arange(len(traced_descIDs))))

This gets us most of the way there; but if we also want to filter out singleton groups while avoiding any python loops and staying entirely vectorized, we can go a little lower level, and do:

g = npi.group_by(traced_descIDs)
unique = g.unique
idx = g.split_array_as_list(np.arange(len(traced_descIDs)))
duplicates = unique[g.count>1]
idx_duplicates = np.asarray(idx)[g.count>1]
print(duplicates, idx_duplicates)

Comments

0

np.unqiue for Ndims

I had a similar problem with an ndArray in which I want to find which rows are duplicated.

x = np.arange(60).reshape(5,4,3)
x[1] = x[0]

0 and 1 should be duplicates in axis 0. I used np.unique and returned all options. Then use Jaime's method to locate the duplicates.

_,i,_,c = np.unique(x,1,1,1,axis=0)
x_dup = x[i[1<c]]

I unnecessarily use return_inverse for clarity. Here are the result:

>>> print(x_dupilates)
[[[ 0  1  2]
  [ 3  4  5]
  [ 6  7  8]
  [ 9 10 11]]]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.