Python Speedup np.unique

Question

I am looking to speed up the following piece of code:

NNlist=[np.unique(i) for i in NNlist]

where NNlist is a list of np.arrays with duplicated entries.

Thanks :)

Is the original NNlist actually a multi dimensional numpy array ? (and why would you then want the result to be a Python list?) — Jon Clements
– Jon Clements, Commented Dec 3, 2012 at 17:45
Actually NNlist is a list of np.arrays because it is created using append (the size is not know a priori). However, at this point I found that creating a list or a np.array does not matter — bios
– bios, Commented Dec 3, 2012 at 17:48

endolith · Accepted Answer · 2019-08-07 14:03:38Z

7

numpy.unique is already pretty optimized, you're not likely to get get much of a speedup over what you already have unless you know something else about the underlying data. For example if the data is all small integers you might be able to use numpy.bincount or if the unique values in each of the arrays are mostly the same there might be some optimization that could be done over the whole list of arrays.

edited Aug 7, 2019 at 14:03

endolith

27k35 gold badges138 silver badges205 bronze badges

answered Dec 3, 2012 at 17:59

Bi Rico

25.9k3 gold badges57 silver badges75 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

illan Over a year ago

The numpy unique took 27.089452981948853 seconds The fastremap took 0.2866370677947998 seconds We're running into a big slowdown on large 3D images when having to switch to numpy.unique()

John Zwinck · Accepted Answer · 2018-06-05 08:14:53Z

5

pandas.unique() is much faster than numpy.unique(). The Pandas version does not sort the result, but you can do that yourself and it will still be much faster if the result is much smaller than the input (i.e. there are a lot of duplicate values):

np.sort(pd.unique(arr))

Timings:

In [1]: x = np.random.randint(10, 20, 50000000)

In [2]: %timeit np.sort(pd.unique(x))
201 ms ± 9.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit np.unique(x)
1.49 s ± 27.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

answered Jun 5, 2018 at 8:14

John Zwinck

252k44 gold badges346 silver badges459 bronze badges

Comments

HarryLeong · Accepted Answer · 2020-12-29 17:19:32Z

The numpy.unique() is based on sorting (quicksort), and the pandas.unique() is based on hash table. Normally, the latter is faster according to my benchmarks. They are already very optimized. For some special case, you can continue to optimize the performance. For example, if the data already sorted, you can skip the sorting method:

# ar is already sorted
# this segment is from source code of numpy
mask = np.empty(ar.shape, dtype=np.bool_)
mask[:1] = True
mask[1:] = ar[1:] != ar[:-1]
ret = ar[mask]

I meet the similar problem to yours. I wrote my unique function for my use. Because the pandas.unique doesn't support return_counts option. It is fast implementation. But my implementation only supports integers array. You can check out the source code here.

Alex L · Accepted Answer · 2018-08-23 18:23:09Z

I also took a look at list(set()) and strings in a list, between pandas Series and python lists.

data = np.random.randint(0,10,100)
data_hex = [str(hex(n)) for n in data] # just some simple strings

sample1 = pd.Series(data, name='data')
sample2 = data.tolist()

sample3 = pd.Series(data_hex, name='data')
sample4 = data_hex

And then the benchmarks:

%timeit np.unique(sample1) # 16.4 µs ± 464 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.unique(sample2) # 15.9 µs ± 743 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.unique(sample3) # 45.8 µs ± 5.88 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.unique(sample4) # 20.6 µs ± 680 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit pd.unique(sample1) # 60.3 µs ± 5.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit pd.unique(sample2) # 196 µs ± 18.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit pd.unique(sample3) # 79.7 µs ± 3.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit pd.unique(sample4) # 214 µs ± 61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each

%timeit list(set(sample1)) # 16.3 µs ± 1.63 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit list(set(sample2)) # 1.64 µs ± 83.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit list(set(sample3)) # 17.8 µs ± 1.96 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit list(set(sample4)) # 2.48 µs ± 439 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

The take away is:

Starting with a Pandas Series with integers? Go with either np.unique() or list(set())
Starting with a Pandas Series with strings? Go with list(set())
Starting with a list of integers? Go with list(set())
Starting with a list of strings? Go with list(set())

However, if N=1,000,000 instead, the results are different.

%timeit np.unique(sample1) # 26.5 ms ± 616 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.unique(sample2) # 98.1 ms ± 3.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.unique(sample3) # 1.31 s ± 78.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.unique(sample4) # 174 ms ± 2.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit pd.unique(sample1) # 10.5 ms ± 472 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit pd.unique(sample2) # 99.3 ms ± 5.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.unique(sample3) # 46.4 ms ± 4.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.unique(sample4) # 113 ms ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit list(set(sample1)) # 25.9 ms ± 2.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(set(sample2)) # 11.2 ms ± 496 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit list(set(sample3)) # 37.1 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(set(sample4)) # 20.2 ms ± 843 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Starting with a Pandas Series with integers? Go with pd.unique()
Starting with a Pandas Series with strings? Go with list(set())
Starting with a list of integers? Go with list(set())
Starting with a list of strings? Go with list(set())

I would use frozenset and not set here, should be at least a little faster.

bmu · Accepted Answer · 2012-12-03 20:32:44Z

0

Here are some benchmarks:

In [72]: ar_list = [np.random.randint(0, 100, 1000) for _ in range(100)]

In [73]: %timeit map(np.unique, ar_list)
100 loops, best of 3: 4.9 ms per loop

In [74]: %timeit [np.unique(ar) for ar in ar_list]
100 loops, best of 3: 4.9 ms per loop

In [75]: %timeit [pd.unique(ar) for ar in ar_list] # using pandas
100 loops, best of 3: 2.25 ms per loop

So pandas.unique seems to be faster than numpy.unique. However the docstring mentions that the values are "not necessarily sorted", which (partly) explains, that it is faster. Using a list comprehension or map doesn't give a difference in this example.

answered Dec 3, 2012 at 20:32

bmu

36.4k13 gold badges93 silver badges108 bronze badges

Comments

Cosinux · Accepted Answer · 2025-01-26 16:52:30Z

0

I recently had to count the number of unique subarrays in a 32-bit 2D integer array, so neither pandas.unique nor pandas.value_counts were an option for me.

I've found that the following solution was faster by about a factor of 3 for an array with 10e7 items and a fairly non-uniform distribution:

import collections
unique_counts = collections.Counter(zip(*sequences_array.T))

How to efficiently convert 2d numpy array into 1d numpy array of tuples?

answered Jan 26 at 16:52

Cosinux

3212 gold badges5 silver badges16 bronze badges

Collectives™ on Stack Overflow

Python Speedup np.unique

6 Answers 6

1 Comment

Comments

Comments

1 Comment

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

1 Comment

Comments

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related