Removing rows with duplicates in a NumPy array

Question

I have a (N,3) array of numpy values:

>>> vals = numpy.array([[1,2,3],[4,5,6],[7,8,7],[0,4,5],[2,2,1],[0,0,0],[5,4,3]])
>>> vals
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 7],
       [0, 4, 5],
       [2, 2, 1],
       [0, 0, 0],
       [5, 4, 3]])

I'd like to remove rows from the array that have a duplicate value. For example, the result for the above array should be:

>>> duplicates_removed
array([[1, 2, 3],
       [4, 5, 6],
       [0, 4, 5],
       [5, 4, 3]])

I'm not sure how to do this efficiently with numpy without looping (the array could be quite large). Anyone know how I could do this?

By "without looping" what do you mean? You've got to check every item in the array, so it's O(m*n) no matter what tricks you use to hide the loop. — agf
– agf, Commented Sep 15, 2011 at 23:14
I think he means looping in Numpy rather than looping in Python. O(mn) inside a compiled Numpy function is much faster than O(mn) in a Python for loop. When the options are compiled code and interpreted code, constants matter. — Jim Pivarski
– Jim Pivarski, Commented Jun 18, 2014 at 16:17
From your comments, since, you were looking to generalize this to handle generic no. of columns, you might find this solution to this question worth a read. — Divakar
– Divakar, Commented Jul 17, 2017 at 5:47

Benjamin · Accepted Answer · 2011-09-15 23:38:36Z

11

This is an option:

import numpy
vals = numpy.array([[1,2,3],[4,5,6],[7,8,7],[0,4,5],[2,2,1],[0,0,0],[5,4,3]])
a = (vals[:,0] == vals[:,1]) | (vals[:,1] == vals[:,2]) | (vals[:,0] == vals[:,2])
vals = numpy.delete(vals, numpy.where(a), axis=0)

edited Sep 15, 2011 at 23:38

answered Sep 15, 2011 at 23:12

Benjamin

12k13 gold badges75 silver badges120 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Ned Batchelder Over a year ago

I was trying to work this out, good job. But don't you need | not ^ ?

jterrace Over a year ago

This is much faster than the list comprehension methods, so I'll probably accept. Wondering if there is any way to generalize to NxM though?

Benjamin Over a year ago

@Ned Batchelder: yes, although it doesn't change anything in this case.

agf Over a year ago

@jterrace You could generalize by generating the combinations of 0-m, using them in a generator expression to make the comparisons, then reducing by | to get a.

Divakar · Accepted Answer · 2018-05-25 08:02:27Z

3

Here's an approach to handle generic number of columns and still be a vectorized method -

def rows_uniq_elems(a):
    a_sorted = np.sort(a,axis=-1)
    return a[(a_sorted[...,1:] != a_sorted[...,:-1]).all(-1)]

Steps :

Sort along each row.
Look for differences between consecutive elements in each row. Thus, any row with at least one zero differentiation indicates a duplicate element. We will use this to get a mask of valid rows. So, the final step is to simply select valid rows off input array, using the mask.

Sample run -

In [49]: a
Out[49]: 
array([[1, 2, 3, 7],
       [4, 5, 6, 7],
       [7, 8, 7, 8],
       [0, 4, 5, 6],
       [2, 2, 1, 1],
       [0, 0, 0, 3],
       [5, 4, 3, 2]])

In [50]: rows_uniq_elems(a)
Out[50]: 
array([[1, 2, 3, 7],
       [4, 5, 6, 7],
       [0, 4, 5, 6],
       [5, 4, 3, 2]])

edited May 25, 2018 at 8:02

answered Jul 17, 2017 at 5:44

Divakar

222k19 gold badges273 silver badges374 bronze badges

4 Comments

Josmoor98 Over a year ago

Out of interest isnp.sort(a) equivalent to a[np.arange(idx.shape[0])[:,None], idx]?

Divakar Over a year ago

@EBB Not sure why I was going that indirect way. Updated with that sorting. Thanks for the suggestion!

Josmoor98 Over a year ago

Great, thanks! I was actually reading your answer again just as you uploaded! Freaky! Is ... the same as : in your slicing operation? I haven't seen this implementation before? I'm also curious if there is a difference between using axis = -1 and axis = 1? For my problem, both operations return the same answer? Is there a specific reason for choosing axis = -1 in your solution? Thanks for your help!

Divakar Over a year ago

@EBB That's just a bit more generic as it handle any generic dimension array to remove rows. So, any 2D, 3D, etc array would work now.

Marcelo Cantos · Accepted Answer · 2011-09-15 23:10:25Z

2

numpy.array([v for v in vals if len(set(v)) == len(v)])

Mind you, this still loops behind the scenes. You can't avoid that. But it should work fine even for millions of rows.

answered Sep 15, 2011 at 23:10

Marcelo Cantos

187k40 gold badges338 silver badges366 bronze badges

2 Comments

agf Over a year ago

I came up with [item for item in vals if Counter(item).most_common(1)[0][1] is 1] but that's nicer, especially since you already know len(v). You're still "looping" in that you're iterating over the array, however.

jterrace Over a year ago

This is actually surprisingly fast for a large array though, although I need the index locations of the duplicates, so I like @Benjamin's solution

tellis · Accepted Answer · 2018-02-27 13:31:10Z

Its six years on, but this question helped me, so I ran a comparison for speed for the answers given by Divakar, Benjamin, Marcelo Cantos and Curtis Patrick.

import numpy as np
vals = np.array([[1,2,3],[4,5,6],[7,8,7],[0,4,5],[2,2,1],[0,0,0],[5,4,3]])

def rows_uniq_elems1(a):
    idx = a.argsort(1)
    a_sorted = a[np.arange(idx.shape[0])[:,None], idx]
    return a[(a_sorted[:,1:] != a_sorted[:,:-1]).all(-1)]

def rows_uniq_elems2(a):
    a = (a[:,0] == a[:,1]) | (a[:,1] == a[:,2]) | (a[:,0] == a[:,2])
    return np.delete(a, np.where(a), axis=0)

def rows_uniq_elems3(a):
    return np.array([v for v in a if len(set(v)) == len(v)])

def rows_uniq_elems4(a):
    return np.array([v for v in a if len(np.unique(v)) == len(v)])

Results:

%timeit rows_uniq_elems1(vals)
10000 loops, best of 3: 67.9 µs per loop

%timeit rows_uniq_elems2(vals)
10000 loops, best of 3: 156 µs per loop

%timeit rows_uniq_elems3(vals)
1000 loops, best of 3: 59.5 µs per loop

%timeit rows_uniq_elems(vals)
10000 loops, best of 3: 268 µs per loop

It seems that using set beats numpy.unique. In my case I needed to do this over a much larger array:

bigvals = np.random.randint(0,10,3000).reshape([3,1000])

%timeit rows_uniq_elems1(bigvals)
10000 loops, best of 3: 276 µs per loop

%timeit rows_uniq_elems2(bigvals)
10000 loops, best of 3: 192 µs per loop

%timeit rows_uniq_elems3(bigvals)
10000 loops, best of 3: 6.5 ms per loop

%timeit rows_uniq_elems4(bigvals)
10000 loops, best of 3: 35.7 ms per loop

The methods without list comprehensions are much faster. However, the number of rows are hard coded, and are difficult to extend to more than three columns, so in my case at least the list comprehension with the set is the best answer.

EDITED because I confused rows and columns in bigvals

agf · Accepted Answer · 2011-09-15 23:17:19Z

1

Identical to Marcelo, but I think using numpy.unique() instead of set() may get across exactly what you are shooting for.

numpy.array([v for v in vals if len(numpy.unique(v)) == len(v)])

edited Sep 15, 2011 at 23:17

agf

178k45 gold badges300 silver badges241 bronze badges

answered Sep 15, 2011 at 23:14

Curtis Patrick

978 bronze badges

2 Comments

Marcelo Cantos Over a year ago

Well, set also gets across the same intent, but is numpy.unique faster, perhaps?

jterrace Over a year ago

It actually seems to be much slower - 23 seconds for numpy.unique() vs. 3 seconds for set() on my machine with 1 million rows

Collectives™ on Stack Overflow

Removing rows with duplicates in a NumPy array

5 Answers 5

4 Comments

4 Comments

2 Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

4 Comments

4 Comments

2 Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related