Get non-duplicate rows from numpy array

Question

Let's say I have a numpy array of the form

x = np.array([[2, 5],
              [3, 4],
              [1, 3],
              [2, 5],
              [4, 5],
              [1, 3],
              [1, 4],
              [3, 4]])

What I would like to get from this is an array which contains only the rows which are NOT duplicates, i.e., I expect from this example

array([[4, 5],
       [1, 4]])

I'm looking for a method which is reasonably fast and scales well. The only way that I can think to do this is

First find the set of unique rows in x, as a new array y.
Create a new array z which has those individual elements of y removed from x, thus z is a list of the duplicated rows in x.
Do a set difference between x and z.

This seems horribly inefficient though. Anyone have a better way?

If it is important, I'm guaranteed that each of my rows will be sorted smallest to largest so that you'll never have a row be [5, 2] or [3, 1].

Why do you think this is inefficient? With a hash table this should be a O(n) time algorithm which is pretty reasonable. You can not do much better since you have to look at each element. — Haochen Wu
– Haochen Wu, Commented Apr 29, 2016 at 20:50
I expect it to be inefficient due to having to perform the loops myself. I know of no native numpy method that will perform step 2. — zephyr
– zephyr, Commented Apr 29, 2016 at 20:59
Oh, I see, but I don't think either numpy or pandas optimized these with C code, you might also want to compare the running time with your own loop. — Haochen Wu
– Haochen Wu, Commented Apr 29, 2016 at 21:11

Divakar · Accepted Answer · 2016-04-30 07:26:48Z

4

Approach #1

Here's an approach based on np.unique and considering each row as an indexing tuple for efficiency (assuming that the input array has integers) -

# Consider each row as indexing tuple & get linear indexing value             
lid = np.ravel_multi_index(x.T,x.max(0)+1)

# Get counts and unique indices
_,idx,count = np.unique(lid,return_index=True,return_counts=True)

# See which counts are exactly 1 and select the corresponding unique indices 
# and thus the correspnding rows from input as the final output
out = x[idx[count==1]]

Note: If there is a huge number of columns in the input array, you might want to get the linear indices lid manually, for which you can use np.cumprod, like so -

lid = x.dot(np.append(1,(x.max(0)+1)[::-1][:-1].cumprod())[::-1])

Approach #2

Here's an alternative one that offloads the counting task to np.bincount, which might be more efficient for such purposes -

# Consider each row as indexing tuple & get linear indexing value             
lid = np.ravel_multi_index(x.T,x.max(0)+1)

# Get unique indices and tagged indices for all elements
_,unq_idx,tag_idx = np.unique(lid,return_index=True,return_inverse=True)

# Use the tagged indices to count and look for count==1 and repeat like before
out = x[unq_idx[np.bincount(tag_idx)==1]]

Approach #3

Here's a different approach using convolution to catch such a pattern. Let the inlined comments help out to understand the underlying idea. Here goes -

# Consider each row as indexing tuple & get linear indexing value             
lid = np.ravel_multi_index(x.T,x.max(0)+1)

# Store sorted indices for lid
sidx = lid.argsort()

# Append 1s at either ends of sorted and differentiated version of lid
mask = np.hstack((True,np.diff(lid[sidx])!=0,True))

# Perform convolution on it. Thus non duplicate elements would have
# consecutive two True elements, which could be caught with convolution
# kernel of [1,1]. Get the corresponding mask. 
# Index into sorted indices with it for final output
out = x[sidx[(np.convolve(mask,[1,1])>1)[1:-1]]]

edited Apr 30, 2016 at 7:26

answered Apr 29, 2016 at 21:06

Divakar

222k19 gold badges273 silver badges374 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

zephyr Over a year ago

I like these methods. I'm actually finding that the first method you've got is faster than the second. Great work though!

Divakar Over a year ago

@zephyr Ah I see. Think the performance difference between these two approaches depend on the input array format!

Divakar Over a year ago

@zephyr Could you check out the just added third approach for testing? Thanks!

Colonel Beauvel · Accepted Answer · 2016-04-29 20:50:41Z

2

Here is a pandas approach:

pd.DataFrame(x.T).T.drop_duplicates(keep=False).as_matrix()

#array([[4, 5],
#       [1, 4]])

answered Apr 29, 2016 at 20:50

Colonel Beauvel

31.3k11 gold badges49 silver badges88 bronze badges

3 Comments

zephyr Over a year ago

Is the keep=False necessary? I get an unexpected keyword error .

MSeifert Over a year ago

Sometimes it can be so easy. I wished I'd thought of that 20 minutes ago :-(

Colonel Beauvel Over a year ago

be sure to update your pandas version! Otherwise, for former version, with a two liner you can use a mask df.duplicated() | df.duplicated(take_last=True) to remove all duplicates.

MSeifert · Accepted Answer · 2016-04-29 20:50:52Z

1

One possibility (requiring a lot of memory for arrays containing a lot of elements) is by first creating a boolean mask where the rows are equal:

b = np.sum(x[:, None, :] == x, axis=2)
b
array([[2, 0, 0, 2, 1, 0, 0, 0],
       [0, 2, 0, 0, 0, 0, 1, 2],
       [0, 0, 2, 0, 0, 2, 1, 0],
       [2, 0, 0, 2, 1, 0, 0, 0],
       [1, 0, 0, 1, 2, 0, 0, 0],
       [0, 0, 2, 0, 0, 2, 1, 0],
       [0, 1, 1, 0, 0, 1, 2, 1],
       [0, 2, 0, 0, 0, 0, 1, 2]])

This array shows which row has how many equal elements with another row. The diagonal is comparing the row with itself so needs to be set to zero:

np.fill_diagonal(b, 0)
b
array([[0, 0, 0, 2, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1, 2],
       [0, 0, 0, 0, 0, 2, 1, 0],
       [2, 0, 0, 0, 1, 0, 0, 0],
       [1, 0, 0, 1, 0, 0, 0, 0],
       [0, 0, 2, 0, 0, 0, 1, 0],
       [0, 1, 1, 0, 0, 1, 0, 1],
       [0, 2, 0, 0, 0, 0, 1, 0]])

Now let's see what is the maximum for each row:

c = np.max(b, axis=0)
c
array([2, 2, 2, 2, 1, 2, 1, 2])

and then we need to find the values where this maximum is !=2 and index these from the original array:

x[np.where([c != 2])[1]]
array([[4, 5],
       [1, 4]])

answered Apr 29, 2016 at 20:50

MSeifert

154k41 gold badges356 silver badges377 bronze badges

2 Comments

zephyr Over a year ago

I like that this keeps everything in numpy. Can you tell me how the construct x[:, None, :] works? I've never seen that before.

MSeifert Over a year ago

It adds another dimension. So the result is 3d with an empty axis as second axis and the original second axis as third. Alternativ you yan use np.expand_dims(x, axis=1) which is a bit more intuitive but more to write :-)

ev-br · Accepted Answer · 2016-04-30 05:53:09Z

1

For completness, see also item 78 in http://www.labri.fr/perso/nrougier/teaching/numpy.100/

answered Apr 30, 2016 at 5:53

ev-br

26.3k9 gold badges68 silver badges84 bronze badges

Comments

Eelco Hoogendoorn · Accepted Answer · 2016-04-30 06:47:54Z

1

This problem can be solved efficiently using the numpy_indexed package (disclaimer: I am its author):

import numpy_indexed as npi
x[npi.multiplicity(x) == 1]

Not only is this solution very readable, it is also very efficient, and works with any number of columns or dtypes.

answered Apr 30, 2016 at 6:47

Eelco Hoogendoorn

10.8k1 gold badge46 silver badges43 bronze badges

Collectives™ on Stack Overflow

Get non-duplicate rows from numpy array

5 Answers 5

3 Comments

3 Comments

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

3 Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related