Getting Unique 1D NumPy Array Values without Sorting

Question

I have many large 1D arrays and I'd like to grab the unique values. Typically, one could do:

x = np.random.randint(10000, size=100000000)
np.unique(x)

However, this performs an unnecessary sort of the array. The docs for np.unique do not mention any way to retrieve the indices without sorting. Other answers with np.unique include using return_index but, as I understand it, the array is still being sorted. So, I tried using set:

set(x)

But this is way slower than sorting the array with np.unique. Is there a faster way to retrieve the unique values for this array that avoids sorting and is faster than np.unique?

You can use pandas : pd.Series(x).unique(). Seems a bit faster. — Divakar
– Divakar, Commented Feb 12, 2020 at 21:02
Sorting is needed to efficiently check for duplicates especially when the arrays become larger so I think the most efficient algorithm includes sorting 'under the hood' . — BramAppel
– BramAppel, Commented Feb 12, 2020 at 21:08
@Divakar I was hoping to keep this in NumPy-land in order to avoid adding an additional package dependency since I expect to open source the code. I want to make sure that the juice is worth the squeeze — slaw
– slaw, Commented Feb 12, 2020 at 21:10
unique works by sorting, and then looking for adjacent matching values. Whether you ask for the index or not, it doesn't change the basic mechanism. set uses Python's hashing (which is also used for dict). Is there some other, more efficient, approach? — hpaulj
– hpaulj, Commented Feb 12, 2020 at 21:51

Alain T. · Accepted Answer · 2020-02-13 05:41:41Z

2

If your values are positive integers in a relatively small range (e.g. 0 ... 10000), there is an alternative way to obtain a list of unique values using masks: (see unique2() below)

import numpy as np

def unique1(x):
    return np.unique(x)

def unique2(x):
    maxVal    = np.max(x)+1
    values    = np.arange(maxVal)
    used      = np.zeros(maxVal)
    used[x]   = 1
    return values[used==1]

# optimized (with option to provide known value range)
def unique3(x,maxVal=None):
    maxVal    = maxVal or np.max(x)+1
    used      = np.zeros(maxVal,dtype=np.uint8)
    used[x]   = 1
    return np.argwhere(used==1)[:,0]

In my tests this method is a lot faster than np.unique and it does not involve sorting:

from timeit import timeit
count = 3
x = np.random.randint(10000, size=100000000)

t = timeit(lambda:unique1(x),number=count)
print("unique1",t)

t = timeit(lambda:unique2(x),number=count)
print("unique2",t)

t = timeit(lambda:unique3(x),number=count)
print("unique3",t)

t = timeit(lambda:unique3(x,10000),number=count)
print("unique3",t, "with known value range")


# unique1 16.894681214000002
# unique2 0.8627655060000023
# unique3 0.8411087540000004
# unique3 0.5896318829999991 with known value range

edited Feb 13, 2020 at 5:41

answered Feb 12, 2020 at 23:22

Alain T.

42.2k4 gold badges36 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

o11c Over a year ago

You should create the zeroes withdtype=uint8. IIRC dtype=bool doesn't do anything useful.

o11c Over a year ago

Note also that gmpy2.xmpz can be used as a bitvector for more memory-efficiency (and also avoids the temporaries). but doesn't allow the values[array] trick so will cost a lot more CPU unless you're using something that JITs your loops.

Alain T. Over a year ago

I merely wanted to point out an alternative method with orders of magnitude improvements in speed. I agree that it can be further improved by a few % with low level optimizations (e.g. using unit8 shaves off 3%).

Daniel F · Accepted Answer · 2020-02-13 10:55:07Z

0

Just in case you change your mind about dependencies, here's a dirt simple numba.njit implementation:

import numba

@numba.njit
def unique(arr):
    return np.array(list(set(arr)))


%timeit unique(x) #using Alain T.'s benchmark array
2.64 s ± 799 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit np.unique(x)
5.45 s ± 233 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Not as lightning fast as Above, but doesn't require positive integer inputs, either.

edited Feb 13, 2020 at 10:55

answered Feb 13, 2020 at 10:38

Daniel F

14.5k2 gold badges34 silver badges59 bronze badges

Collectives™ on Stack Overflow

Getting Unique 1D NumPy Array Values without Sorting

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related