79

How can I remove duplicate rows of a 2 dimensional numpy array?

data = np.array([[1,8,3,3,4],
                 [1,8,9,9,4],
                 [1,8,3,3,4]])

The answer should be as follows:

ans = array([[1,8,3,3,4],
             [1,8,9,9,4]])

If there are two rows that are the same, then I would like to remove one "duplicate" row.

4
  • 1
    Is it okay if the rows are not in that order orginally present in input array? Commented Jun 28, 2015 at 7:39
  • 1
    yes, order is not important Commented Jun 28, 2015 at 7:40
  • 1
    My problem is very similar to yours. [Look here][1] [1]: stackoverflow.com/questions/31093261/… Commented Jun 28, 2015 at 7:42
  • 3
    I believe now you can apply np.unique over an axis, so np.unique(data, axis = 0) works. Commented Jan 30, 2018 at 20:57

3 Answers 3

105

You can use numpy unique. Since you want the unique rows, we need to put them into tuples:

import numpy as np

data = np.array([[1,8,3,3,4],
                 [1,8,9,9,4],
                 [1,8,3,3,4]])

just applying np.unique to the data array will result in this:

>>> uniques
array([1, 3, 4, 8, 9])

prints out the unique elements in the list. So putting them into tuples results in:

new_array = [tuple(row) for row in data]
uniques = np.unique(new_array)

which prints:

>>> uniques
array([[1, 8, 3, 3, 4],
       [1, 8, 9, 9, 4]])

UPDATE

In the new version, you need to set np.unique(data, axis=0)

Sign up to request clarification or add additional context in comments.

10 Comments

I tried new_array = [tuple(row) for row in data] uniques = np.unique(new_array) but it still output uniques array([1, 3, 4, 8, 9]) @ThePredator
Here is the code, I used the same code as your show: import numpy as np data = np.array([[1,8,3,3,4], [1,8,9,9,4], [1,8,3,3,4]]) new_array = [tuple(row) for row in data] uniques = np.unique(new_array) uniques Out[30]: array([1, 3, 4, 8, 9]) Is that anything about the numpy version? my numpy version is 1.9.2
I think the following is the right answer stackoverflow.com/questions/16970982/…
In the new version, you need to set np.unique(data, axis=0)
Note that Divakar's lexsort solution is still the fastest presented here (at least for this example).
|
28

One approach with lex-sorting -

# Perform lex sort and get sorted data
sorted_idx = np.lexsort(data.T)
sorted_data =  data[sorted_idx,:]

# Get unique row mask
row_mask = np.append([True],np.any(np.diff(sorted_data,axis=0),1))

# Get unique rows
out = sorted_data[row_mask]

Sample run -

In [199]: data
Out[199]: 
array([[1, 8, 3, 3, 4],
       [1, 8, 9, 9, 4],
       [1, 8, 3, 3, 4],
       [1, 8, 3, 3, 4],
       [1, 8, 0, 3, 4],
       [1, 8, 9, 9, 4]])

In [200]: sorted_idx = np.lexsort(data.T)
     ...: sorted_data =  data[sorted_idx,:]
     ...: row_mask = np.append([True],np.any(np.diff(sorted_data,axis=0),1))
     ...: out = sorted_data[row_mask]
     ...: 

In [201]: out
Out[201]: 
array([[1, 8, 0, 3, 4],
       [1, 8, 3, 3, 4],
       [1, 8, 9, 9, 4]])

Runtime tests -

This section times all approaches proposed in the solutions presented thus far.

In [34]: data = np.random.randint(0,10,(10000,10))

In [35]: def tuple_based(data):
    ...:     new_array = [tuple(row) for row in data]
    ...:     return np.unique(new_array)
    ...: 
    ...: def lexsort_based(data):                 
    ...:     sorted_data =  data[np.lexsort(data.T),:]
    ...:     row_mask = np.append([True],np.any(np.diff(sorted_data,axis=0),1))
    ...:     return sorted_data[row_mask]
    ...: 
    ...: def unique_based(a):
    ...:     a = np.ascontiguousarray(a)
    ...:     unique_a = np.unique(a.view([('', a.dtype)]*a.shape[1]))
    ...:     return unique_a.view(a.dtype).reshape((unique_a.shape[0], a.shape[1]))
    ...: 

In [36]: %timeit tuple_based(data)
10 loops, best of 3: 63.1 ms per loop

In [37]: %timeit lexsort_based(data)
100 loops, best of 3: 8.92 ms per loop

In [38]: %timeit unique_based(data)
10 loops, best of 3: 29.1 ms per loop

1 Comment

f.y.i.: unique_based is about twice as fast as np.unique(data, axis=0), so lexsort is still preferable in 2022.
8

A simple solution can be:

import numpy as np
def unique_rows(a):
    a = np.ascontiguousarray(a)
    unique_a = np.unique(a.view([('', a.dtype)]*a.shape[1]))
    return unique_a.view(a.dtype).reshape((unique_a.shape[0], a.shape[1]))

data = np.array([[1,8,3,3,4],
                 [1,8,9,9,4],
                 [1,8,3,3,4]])


print unique_rows(data)
#prints:
[[1 8 3 3 4]
 [1 8 9 9 4]]

You can check this for many more solutions for this problem

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.