2

I have Numpy array in python with two columns as follows :

time,id
1,a
2,b
3,a
1,a
5,c
6,b
3,a

i want to take unique time of each user. For above data i want below output.

time,id
1,a
2,b
3,a
5,c
6,b

That is, I want to take only unique rows. so, 1,a and 3,a will not repeat in the result. I have both the column as string datatype and have a very large 2-D array. one solution may be, i can iterate over all the rows and make a set. But that will be very slow. Please suggest an efficient way to implement it.

5
  • Do you have a numpy array or pandas data frame? Commented Sep 16, 2016 at 23:22
  • 1
    Possible duplicate of Find unique rows in numpy.array Commented Sep 16, 2016 at 23:39
  • What's the shape and dtype of your array? Commented Sep 17, 2016 at 1:05
  • @hpaulj I have mentioned that array has 2 columns in 1st line. and data type is string. That is also mentioned. Commented Sep 17, 2016 at 4:29
  • @Psidom Clearly written that it is Numpy array. Commented Sep 17, 2016 at 4:29

4 Answers 4

6

Given:

>>> b
[['1' 'a']
 ['2' 'b']
 ['3' 'a']
 ['1' 'a']
 ['5' 'c']
 ['6' 'b']
 ['3' 'a']]

You can do:

>>> np.vstack({tuple(e) for e in b})
[['3' 'a']
 ['1' 'a']
 ['2' 'b']
 ['6' 'b']
 ['5' 'c']]

Since that is a set comprehension, you loose the order of the original.

Or, to maintain order, you can do:

>>> c = np.ascontiguousarray(b).view(np.dtype((np.void, b.dtype.itemsize * b.shape[1])))
>>> b[np.unique(c, return_index=True)[1]]
[['1' 'a']
 ['2' 'b']
 ['3' 'a']
 ['5' 'c']
 ['6' 'b']]

Or, if you can use Pandas, this is really easy. Given the following DataFrame:

>>> df
  id  time
0  a     1
1  b     2
2  a     3
3  a     1
4  c     5
5  b     6
6  a     3

Just use drop_duplicates():

>>> df.drop_duplicates()
  id  time
0  a     1
1  b     2
2  a     3
4  c     5
5  b     6
Sign up to request clarification or add additional context in comments.

Comments

1

If you go back to your original list format data and create a structured array, then determining the unique values is much easier.

a = [['1', 'a'], ['2', 'b'], ['3', 'a'],['1', 'a'],['5', 'c'], ['6', 'b'], ['3', 'a']]

tup = [tuple(i) for i in a]  # you need a list of tuples, a kludge for now

dt = [('f1', '<U5'), ('f2', '<U5')]  # specify a dtype with two columns

b = np.array(tup, dtype=dt)  # create the array with the dtype

np.unique(b)  # get the unique values
array([('1', 'a'), ('2', 'b'), ('3', 'a'), ('5', 'c'), ('6', 'b')], 
      dtype=[('f1', '<U5'), ('f2', '<U5')])

np.unique(b).tolist()  # and if you need a list, just change the array
[('1', 'a'), ('2', 'b'), ('3', 'a'), ('5', 'c'), ('6', 'b')]

Reference: Find unique rows in numpy.array

A combination of Joe Kingston and Jaime recommendations deal with views and the above can be simplified to the following. Nicely, this option relies on view, a change of dtype to a structured array and a slice into the original array using the indices of the unique values in the structured view.

>>> dt = a.dtype.descr * a.shape[1]
>>> a_view = a.view(dt)
>>> a_uniq, a_idx = np.unique(a_view, return_index=True)
>>> a[a_idx]
array([['1', 'a'],
       ['2', 'b'],
       ['3', 'a'],
       ['5', 'c'],
       ['6', 'b']], 
      dtype='<U1')

Comments

1

For Future readers a pure numpy way to drop duplicates based on a specific row/column:

x = np.array(
[[1,'a'],
[2,'b'],
[3,'a'],
[1,'a'],
[5,'c'],
[6,'b'],
[3,'a']])

print(x[np.unique(x[:,0], axis=0, return_index=True)[1]])

>>[['1' 'a']
   ['2' 'b']
   ['3' 'a']
   ['5' 'c']
   ['6' 'b']]

or more than one column:

print(x[np.unique(x[:,[0, 1]], axis=0, return_index=True)[1]])

3 Comments

this will drop duplicates based on a single column and NOT based on the values of both columns! (e.g. rows like [1, "a"] and [1, "b"] would be identified as duplicate... )
@raphael ??? That's literally what I said "..to drop duplicates based on a specific ...", This is just a generalized answer. do x[:, [0,1]] if you need to consider both
woha, sry, didn't realize that np.unique already considers all values in a row if you use axis=0 ... I added it to the answer and will undo the down-vote once it's accepted!
0

In case somebody still needs it, here's a one-liner :-)

Note that this requires all column values to have the same dtype!

import numpy as np
a = [[1, "a"], [1, "b"], [1, "c"], [2, "a"], [2, "b"], [2, "c"],
     [1, "a"], [1, "b"], [1, "c"], [2, "a"], [2, "b"], [2, "c"]]

unique_a = np.unique(np.rec.fromrecords(a)).tolist()
>>> [(1, 'a'), (1, 'b'), (1, 'c'), (2, 'a'), (2, 'b'), (2, 'c')]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.