Take Unique of numpy array according to 2 column values.

Question

I have Numpy array in python with two columns as follows :

time,id
1,a
2,b
3,a
1,a
5,c
6,b
3,a

i want to take unique time of each user. For above data i want below output.

time,id
1,a
2,b
3,a
5,c
6,b

That is, I want to take only unique rows. so, 1,a and 3,a will not repeat in the result. I have both the column as string datatype and have a very large 2-D array. one solution may be, i can iterate over all the rows and make a set. But that will be very slow. Please suggest an efficient way to implement it.

@hpaulj I have mentioned that array has 2 columns in 1st line. and data type is string. That is also mentioned. — KrunalParmar
– KrunalParmar, Commented Sep 17, 2016 at 4:29

dawg · Accepted Answer · 2016-09-17 01:08:41Z

6

Given:

>>> b
[['1' 'a']
 ['2' 'b']
 ['3' 'a']
 ['1' 'a']
 ['5' 'c']
 ['6' 'b']
 ['3' 'a']]

You can do:

>>> np.vstack({tuple(e) for e in b})
[['3' 'a']
 ['1' 'a']
 ['2' 'b']
 ['6' 'b']
 ['5' 'c']]

Since that is a set comprehension, you loose the order of the original.

Or, to maintain order, you can do:

>>> c = np.ascontiguousarray(b).view(np.dtype((np.void, b.dtype.itemsize * b.shape[1])))
>>> b[np.unique(c, return_index=True)[1]]
[['1' 'a']
 ['2' 'b']
 ['3' 'a']
 ['5' 'c']
 ['6' 'b']]

Or, if you can use Pandas, this is really easy. Given the following DataFrame:

Just use drop_duplicates():

>>> df.drop_duplicates()
  id  time
0  a     1
1  b     2
2  a     3
4  c     5
5  b     6

edited Sep 17, 2016 at 1:08

answered Sep 16, 2016 at 23:45

dawg

105k24 gold badges142 silver badges217 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2017-05-23 11:51:44Z

If you go back to your original list format data and create a structured array, then determining the unique values is much easier.

a = [['1', 'a'], ['2', 'b'], ['3', 'a'],['1', 'a'],['5', 'c'], ['6', 'b'], ['3', 'a']]

tup = [tuple(i) for i in a]  # you need a list of tuples, a kludge for now

dt = [('f1', '<U5'), ('f2', '<U5')]  # specify a dtype with two columns

b = np.array(tup, dtype=dt)  # create the array with the dtype

np.unique(b)  # get the unique values
array([('1', 'a'), ('2', 'b'), ('3', 'a'), ('5', 'c'), ('6', 'b')], 
      dtype=[('f1', '<U5'), ('f2', '<U5')])

np.unique(b).tolist()  # and if you need a list, just change the array
[('1', 'a'), ('2', 'b'), ('3', 'a'), ('5', 'c'), ('6', 'b')]

Reference: Find unique rows in numpy.array

A combination of Joe Kingston and Jaime recommendations deal with views and the above can be simplified to the following. Nicely, this option relies on view, a change of dtype to a structured array and a slice into the original array using the indices of the unique values in the structured view.

>>> dt = a.dtype.descr * a.shape[1]
>>> a_view = a.view(dt)
>>> a_uniq, a_idx = np.unique(a_view, return_index=True)
>>> a[a_idx]
array([['1', 'a'],
       ['2', 'b'],
       ['3', 'a'],
       ['5', 'c'],
       ['6', 'b']], 
      dtype='<U1')

raphael · Accepted Answer · 2023-05-12 09:02:38Z

1

For Future readers a pure numpy way to drop duplicates based on a specific row/column:

x = np.array(
[[1,'a'],
[2,'b'],
[3,'a'],
[1,'a'],
[5,'c'],
[6,'b'],
[3,'a']])

print(x[np.unique(x[:,0], axis=0, return_index=True)[1]])

>>[['1' 'a']
   ['2' 'b']
   ['3' 'a']
   ['5' 'c']
   ['6' 'b']]

or more than one column:

print(x[np.unique(x[:,[0, 1]], axis=0, return_index=True)[1]])

edited May 12, 2023 at 9:02

raphael

3,1891 gold badge20 silver badges33 bronze badges

answered Feb 9, 2021 at 5:32

Julkar9

2,1401 gold badge17 silver badges26 bronze badges

3 Comments

raphael Over a year ago

this will drop duplicates based on a single column and NOT based on the values of both columns! (e.g. rows like [1, "a"] and [1, "b"] would be identified as duplicate... )

Julkar9 Over a year ago

@raphael ??? That's literally what I said "..to drop duplicates based on a specific ...", This is just a generalized answer. do x[:, [0,1]] if you need to consider both

raphael Over a year ago

woha, sry, didn't realize that np.unique already considers all values in a row if you use axis=0 ... I added it to the answer and will undo the down-vote once it's accepted!

raphael · Accepted Answer · 2023-05-11 16:23:28Z

0

In case somebody still needs it, here's a one-liner :-)

Note that this requires all column values to have the same dtype!

import numpy as np
a = [[1, "a"], [1, "b"], [1, "c"], [2, "a"], [2, "b"], [2, "c"],
     [1, "a"], [1, "b"], [1, "c"], [2, "a"], [2, "b"], [2, "c"]]

unique_a = np.unique(np.rec.fromrecords(a)).tolist()
>>> [(1, 'a'), (1, 'b'), (1, 'c'), (2, 'a'), (2, 'b'), (2, 'c')]

answered May 11, 2023 at 16:23

raphael

3,1891 gold badge20 silver badges33 bronze badges

Collectives™ on Stack Overflow

Take Unique of numpy array according to 2 column values.

4 Answers 4

Comments

Comments

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related