Finding unique points in numpy array

Question

What is a faster way of finding unique x,y points (removing duplicates) in a numpy array like:

points = numpy.random.randint(0, 5, (10,2))

I thought of converting points to a complex numbers and then checking for unique, but that seems rather convoluted:

b = numpy.unique(points[:,0] + 1j * points[:,1])
points = numpy.column_stack((b.real, b.imag))

If you don't require order to be preserved, use tuples for the points and convert the list to a set. — wim
– wim, Commented Nov 3, 2011 at 3:41
I need the result to be a numpy array, so that seems like a lot of conversions. — Benjamin
– Benjamin, Commented Nov 3, 2011 at 3:44
Is there a real reason why the simple solution numpy.vstack([numpy.array(u) for u in set([tuple(p) for p in points])]) is not fast enough? — wim
– wim, Commented Nov 3, 2011 at 3:50
Thinking there must be a faster way than list comprehension when it gets to longer lists of points, no? — Benjamin
– Benjamin, Commented Nov 3, 2011 at 3:56
Wim's method is faster, especially for larger arrays. Probably because it doesn't bother sorting the result. I added some timeit benchmarks to my post. Perhaps wim will post his solution as an answer? — unutbu
– unutbu, Commented Nov 3, 2011 at 4:25

wim · Accepted Answer · 2012-09-19 15:12:09Z

8

I would do it like this:

numpy.array(list(set(tuple(p) for p in points)))

For the fast solution in the most general case, maybe this recipe would interest you: http://code.activestate.com/recipes/52560-remove-duplicates-from-a-sequence/

edited Sep 19, 2012 at 15:12

answered Nov 3, 2011 at 4:36

wim

368k114 gold badges681 silver badges816 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Warrick Over a year ago

I have a similar problem for which this works a charm but it comes out unsorted, even though the data went in sorted. Why does this happen?

wim Over a year ago

Because the ordering is destroyed by set, which is an unordered collection.

wim Over a year ago

There is nothing stopping you from sorting the output, though :) The set is only used as an intermediate step to remove dupes here.

unutbu · Accepted Answer · 2011-11-03 04:22:54Z

7

I think you have a very good idea here. Think about the underlying block of memory used to represent the data in points. We tell numpy to regard that block as representing an array of shape (10,2) with dtype int32 (32-bit integers), but it is almost costless to tell numpy to regard that same block of memory as representing an array of shape (10,) with dtype c8 (64-bit complex).

So the only real cost is calling np.unique, followed by another virtually costless call to view and reshape:

import numpy as np
np.random.seed(1)
points = np.random.randint(0, 5, (10,2))
print(points)
print(len(points))

yields

[[3 4]
 [0 1]
 [3 0]
 [0 1]
 [4 4]
 [1 2]
 [4 2]
 [4 3]
 [4 2]
 [4 2]]
10

while

cpoints = points.view('c8')
cpoints = np.unique(cpoints)
points = cpoints.view('i4').reshape((-1,2))
print(points)
print(len(points))

yields

[[0 1]
 [1 2]
 [3 0]
 [3 4]
 [4 2]
 [4 3]
 [4 4]]
7

If you don't need the result to be sorted, wim's method is faster (You might want to consider accepting his answer...)

import numpy as np
np.random.seed(1)
N=10000
points = np.random.randint(0, 5, (N,2))

def using_unique():
    cpoints = points.view('c8')
    cpoints = np.unique(cpoints)
    return cpoints.view('i4').reshape((-1,2))

def using_set():
    return np.vstack([np.array(u) for u in set([tuple(p) for p in points])])

yields these benchmarks:

% python -mtimeit -s'import test' 'test.using_set()'
100 loops, best of 3: 18.3 msec per loop
% python -mtimeit -s'import test' 'test.using_unique()'
10 loops, best of 3: 40.6 msec per loop

edited Nov 3, 2011 at 4:22

answered Nov 3, 2011 at 3:14

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

7 Comments

unutbu Over a year ago

np.unique sorts the result. Are you looking for a method which keeps the remaining elements in order?

Benjamin Over a year ago

No, I mean, I obtain the wrong result: cpoints.shape is still 10,2 and the final points do not match the original.

unutbu Over a year ago

I've edited the post to show it does work for at least one seed. Can you show an example where it does not work (with a seed so the problem is reproducible)?

Benjamin Over a year ago

Not what I'm seeing... Pyhton 2.7.1, numpy 2.0.0. I'm getting the same result with Python 2.6.7 and numpy 1.5.1 and with Python 2.7.1 and numpy 1.5.1 (on mac), which is [[0,0],[1,0],[2,0],[3,0],[4,0]]... Hmm.

seberg Over a year ago

Not that anyone will C, but recarrays might be nicer then just a larger dtype (and more general)

|

Collectives™ on Stack Overflow

Finding unique points in numpy array

2 Answers 2

3 Comments

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related