13

I have a (N,3) array of numpy values:

>>> vals = numpy.array([[1,2,3],[4,5,6],[7,8,7],[0,4,5],[2,2,1],[0,0,0],[5,4,3]])
>>> vals
array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 7],
       [0, 4, 5],
       [2, 2, 1],
       [0, 0, 0],
       [5, 4, 3]])

I'd like to remove rows from the array that have a duplicate value. For example, the result for the above array should be:

>>> duplicates_removed
array([[1, 2, 3],
       [4, 5, 6],
       [0, 4, 5],
       [5, 4, 3]])

I'm not sure how to do this efficiently with numpy without looping (the array could be quite large). Anyone know how I could do this?

3
  • By "without looping" what do you mean? You've got to check every item in the array, so it's O(m*n) no matter what tricks you use to hide the loop. Commented Sep 15, 2011 at 23:14
  • 1
    I think he means looping in Numpy rather than looping in Python. O(mn) inside a compiled Numpy function is much faster than O(mn) in a Python for loop. When the options are compiled code and interpreted code, constants matter. Commented Jun 18, 2014 at 16:17
  • From your comments, since, you were looking to generalize this to handle generic no. of columns, you might find this solution to this question worth a read. Commented Jul 17, 2017 at 5:47

5 Answers 5

11

This is an option:

import numpy
vals = numpy.array([[1,2,3],[4,5,6],[7,8,7],[0,4,5],[2,2,1],[0,0,0],[5,4,3]])
a = (vals[:,0] == vals[:,1]) | (vals[:,1] == vals[:,2]) | (vals[:,0] == vals[:,2])
vals = numpy.delete(vals, numpy.where(a), axis=0)
Sign up to request clarification or add additional context in comments.

4 Comments

I was trying to work this out, good job. But don't you need | not ^ ?
This is much faster than the list comprehension methods, so I'll probably accept. Wondering if there is any way to generalize to NxM though?
@Ned Batchelder: yes, although it doesn't change anything in this case.
@jterrace You could generalize by generating the combinations of 0-m, using them in a generator expression to make the comparisons, then reducing by | to get a.
3

Here's an approach to handle generic number of columns and still be a vectorized method -

def rows_uniq_elems(a):
    a_sorted = np.sort(a,axis=-1)
    return a[(a_sorted[...,1:] != a_sorted[...,:-1]).all(-1)]

Steps :

  • Sort along each row.

  • Look for differences between consecutive elements in each row. Thus, any row with at least one zero differentiation indicates a duplicate element. We will use this to get a mask of valid rows. So, the final step is to simply select valid rows off input array, using the mask.

Sample run -

In [49]: a
Out[49]: 
array([[1, 2, 3, 7],
       [4, 5, 6, 7],
       [7, 8, 7, 8],
       [0, 4, 5, 6],
       [2, 2, 1, 1],
       [0, 0, 0, 3],
       [5, 4, 3, 2]])

In [50]: rows_uniq_elems(a)
Out[50]: 
array([[1, 2, 3, 7],
       [4, 5, 6, 7],
       [0, 4, 5, 6],
       [5, 4, 3, 2]])

4 Comments

Out of interest isnp.sort(a) equivalent to a[np.arange(idx.shape[0])[:,None], idx]?
@EBB Not sure why I was going that indirect way. Updated with that sorting. Thanks for the suggestion!
Great, thanks! I was actually reading your answer again just as you uploaded! Freaky! Is ... the same as : in your slicing operation? I haven't seen this implementation before? I'm also curious if there is a difference between using axis = -1 and axis = 1? For my problem, both operations return the same answer? Is there a specific reason for choosing axis = -1 in your solution? Thanks for your help!
@EBB That's just a bit more generic as it handle any generic dimension array to remove rows. So, any 2D, 3D, etc array would work now.
2
numpy.array([v for v in vals if len(set(v)) == len(v)])

Mind you, this still loops behind the scenes. You can't avoid that. But it should work fine even for millions of rows.

2 Comments

I came up with [item for item in vals if Counter(item).most_common(1)[0][1] is 1] but that's nicer, especially since you already know len(v). You're still "looping" in that you're iterating over the array, however.
This is actually surprisingly fast for a large array though, although I need the index locations of the duplicates, so I like @Benjamin's solution
2

Its six years on, but this question helped me, so I ran a comparison for speed for the answers given by Divakar, Benjamin, Marcelo Cantos and Curtis Patrick.

import numpy as np
vals = np.array([[1,2,3],[4,5,6],[7,8,7],[0,4,5],[2,2,1],[0,0,0],[5,4,3]])

def rows_uniq_elems1(a):
    idx = a.argsort(1)
    a_sorted = a[np.arange(idx.shape[0])[:,None], idx]
    return a[(a_sorted[:,1:] != a_sorted[:,:-1]).all(-1)]

def rows_uniq_elems2(a):
    a = (a[:,0] == a[:,1]) | (a[:,1] == a[:,2]) | (a[:,0] == a[:,2])
    return np.delete(a, np.where(a), axis=0)

def rows_uniq_elems3(a):
    return np.array([v for v in a if len(set(v)) == len(v)])

def rows_uniq_elems4(a):
    return np.array([v for v in a if len(np.unique(v)) == len(v)])

Results:

%timeit rows_uniq_elems1(vals)
10000 loops, best of 3: 67.9 µs per loop

%timeit rows_uniq_elems2(vals)
10000 loops, best of 3: 156 µs per loop

%timeit rows_uniq_elems3(vals)
1000 loops, best of 3: 59.5 µs per loop

%timeit rows_uniq_elems(vals)
10000 loops, best of 3: 268 µs per loop

It seems that using set beats numpy.unique. In my case I needed to do this over a much larger array:

bigvals = np.random.randint(0,10,3000).reshape([3,1000])

%timeit rows_uniq_elems1(bigvals)
10000 loops, best of 3: 276 µs per loop

%timeit rows_uniq_elems2(bigvals)
10000 loops, best of 3: 192 µs per loop

%timeit rows_uniq_elems3(bigvals)
10000 loops, best of 3: 6.5 ms per loop

%timeit rows_uniq_elems4(bigvals)
10000 loops, best of 3: 35.7 ms per loop

The methods without list comprehensions are much faster. However, the number of rows are hard coded, and are difficult to extend to more than three columns, so in my case at least the list comprehension with the set is the best answer.

EDITED because I confused rows and columns in bigvals

Comments

1

Identical to Marcelo, but I think using numpy.unique() instead of set() may get across exactly what you are shooting for.

numpy.array([v for v in vals if len(numpy.unique(v)) == len(v)])

2 Comments

Well, set also gets across the same intent, but is numpy.unique faster, perhaps?
It actually seems to be much slower - 23 seconds for numpy.unique() vs. 3 seconds for set() on my machine with 1 million rows

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.