5

Let's say I have a numpy array of the form

x = np.array([[2, 5],
              [3, 4],
              [1, 3],
              [2, 5],
              [4, 5],
              [1, 3],
              [1, 4],
              [3, 4]])

What I would like to get from this is an array which contains only the rows which are NOT duplicates, i.e., I expect from this example

array([[4, 5],
       [1, 4]])

I'm looking for a method which is reasonably fast and scales well. The only way that I can think to do this is

  1. First find the set of unique rows in x, as a new array y.
  2. Create a new array z which has those individual elements of y removed from x, thus z is a list of the duplicated rows in x.
  3. Do a set difference between x and z.

This seems horribly inefficient though. Anyone have a better way?

If it is important, I'm guaranteed that each of my rows will be sorted smallest to largest so that you'll never have a row be [5, 2] or [3, 1].

3
  • Why do you think this is inefficient? With a hash table this should be a O(n) time algorithm which is pretty reasonable. You can not do much better since you have to look at each element. Commented Apr 29, 2016 at 20:50
  • I expect it to be inefficient due to having to perform the loops myself. I know of no native numpy method that will perform step 2. Commented Apr 29, 2016 at 20:59
  • Oh, I see, but I don't think either numpy or pandas optimized these with C code, you might also want to compare the running time with your own loop. Commented Apr 29, 2016 at 21:11

5 Answers 5

4

Approach #1

Here's an approach based on np.unique and considering each row as an indexing tuple for efficiency (assuming that the input array has integers) -

# Consider each row as indexing tuple & get linear indexing value             
lid = np.ravel_multi_index(x.T,x.max(0)+1)

# Get counts and unique indices
_,idx,count = np.unique(lid,return_index=True,return_counts=True)

# See which counts are exactly 1 and select the corresponding unique indices 
# and thus the correspnding rows from input as the final output
out = x[idx[count==1]]

Note: If there is a huge number of columns in the input array, you might want to get the linear indices lid manually, for which you can use np.cumprod, like so -

lid = x.dot(np.append(1,(x.max(0)+1)[::-1][:-1].cumprod())[::-1])

Approach #2

Here's an alternative one that offloads the counting task to np.bincount, which might be more efficient for such purposes -

# Consider each row as indexing tuple & get linear indexing value             
lid = np.ravel_multi_index(x.T,x.max(0)+1)

# Get unique indices and tagged indices for all elements
_,unq_idx,tag_idx = np.unique(lid,return_index=True,return_inverse=True)

# Use the tagged indices to count and look for count==1 and repeat like before
out = x[unq_idx[np.bincount(tag_idx)==1]]

Approach #3

Here's a different approach using convolution to catch such a pattern. Let the inlined comments help out to understand the underlying idea. Here goes -

# Consider each row as indexing tuple & get linear indexing value             
lid = np.ravel_multi_index(x.T,x.max(0)+1)

# Store sorted indices for lid
sidx = lid.argsort()

# Append 1s at either ends of sorted and differentiated version of lid
mask = np.hstack((True,np.diff(lid[sidx])!=0,True))

# Perform convolution on it. Thus non duplicate elements would have
# consecutive two True elements, which could be caught with convolution
# kernel of [1,1]. Get the corresponding mask. 
# Index into sorted indices with it for final output
out = x[sidx[(np.convolve(mask,[1,1])>1)[1:-1]]]
Sign up to request clarification or add additional context in comments.

3 Comments

I like these methods. I'm actually finding that the first method you've got is faster than the second. Great work though!
@zephyr Ah I see. Think the performance difference between these two approaches depend on the input array format!
@zephyr Could you check out the just added third approach for testing? Thanks!
2

Here is a pandas approach:

pd.DataFrame(x.T).T.drop_duplicates(keep=False).as_matrix()

#array([[4, 5],
#       [1, 4]])

3 Comments

Is the keep=False necessary? I get an unexpected keyword error .
Sometimes it can be so easy. I wished I'd thought of that 20 minutes ago :-(
be sure to update your pandas version! Otherwise, for former version, with a two liner you can use a mask df.duplicated() | df.duplicated(take_last=True) to remove all duplicates.
1

One possibility (requiring a lot of memory for arrays containing a lot of elements) is by first creating a boolean mask where the rows are equal:

b = np.sum(x[:, None, :] == x, axis=2)
b
array([[2, 0, 0, 2, 1, 0, 0, 0],
       [0, 2, 0, 0, 0, 0, 1, 2],
       [0, 0, 2, 0, 0, 2, 1, 0],
       [2, 0, 0, 2, 1, 0, 0, 0],
       [1, 0, 0, 1, 2, 0, 0, 0],
       [0, 0, 2, 0, 0, 2, 1, 0],
       [0, 1, 1, 0, 0, 1, 2, 1],
       [0, 2, 0, 0, 0, 0, 1, 2]])

This array shows which row has how many equal elements with another row. The diagonal is comparing the row with itself so needs to be set to zero:

np.fill_diagonal(b, 0)
b
array([[0, 0, 0, 2, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 2],
       [0, 0, 0, 0, 0, 2, 1, 0],
       [2, 0, 0, 0, 1, 0, 0, 0],
       [1, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 2, 0, 0, 0, 1, 0],
       [0, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 0, 0, 0, 1, 0]])

Now let's see what is the maximum for each row:

c = np.max(b, axis=0)
c
array([2, 2, 2, 2, 1, 2, 1, 2])

and then we need to find the values where this maximum is !=2 and index these from the original array:

x[np.where([c != 2])[1]]
array([[4, 5],
       [1, 4]])

2 Comments

I like that this keeps everything in numpy. Can you tell me how the construct x[:, None, :] works? I've never seen that before.
It adds another dimension. So the result is 3d with an empty axis as second axis and the original second axis as third. Alternativ you yan use np.expand_dims(x, axis=1) which is a bit more intuitive but more to write :-)
1

For completness, see also item 78 in http://www.labri.fr/perso/nrougier/teaching/numpy.100/

Comments

1

This problem can be solved efficiently using the numpy_indexed package (disclaimer: I am its author):

import numpy_indexed as npi
x[npi.multiplicity(x) == 1]

Not only is this solution very readable, it is also very efficient, and works with any number of columns or dtypes.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.