5

I am looking to speed up the following piece of code:

NNlist=[np.unique(i) for i in NNlist] 

where NNlist is a list of np.arrays with duplicated entries.

Thanks :)

3
  • Is the original NNlist actually a multi dimensional numpy array ? (and why would you then want the result to be a Python list?) Commented Dec 3, 2012 at 17:45
  • Actually NNlist is a list of np.arrays because it is created using append (the size is not know a priori). However, at this point I found that creating a list or a np.array does not matter Commented Dec 3, 2012 at 17:48
  • 1
    map(np.unique, NNlist) might be a good place to start. Commented Dec 3, 2012 at 17:52

6 Answers 6

7

numpy.unique is already pretty optimized, you're not likely to get get much of a speedup over what you already have unless you know something else about the underlying data. For example if the data is all small integers you might be able to use numpy.bincount or if the unique values in each of the arrays are mostly the same there might be some optimization that could be done over the whole list of arrays.

Sign up to request clarification or add additional context in comments.

1 Comment

The numpy unique took 27.089452981948853 seconds The fastremap took 0.2866370677947998 seconds We're running into a big slowdown on large 3D images when having to switch to numpy.unique()
5

pandas.unique() is much faster than numpy.unique(). The Pandas version does not sort the result, but you can do that yourself and it will still be much faster if the result is much smaller than the input (i.e. there are a lot of duplicate values):

np.sort(pd.unique(arr))

Timings:

In [1]: x = np.random.randint(10, 20, 50000000)

In [2]: %timeit np.sort(pd.unique(x))
201 ms ± 9.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit np.unique(x)
1.49 s ± 27.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Comments

2

The numpy.unique() is based on sorting (quicksort), and the pandas.unique() is based on hash table. Normally, the latter is faster according to my benchmarks. They are already very optimized. For some special case, you can continue to optimize the performance. For example, if the data already sorted, you can skip the sorting method:

# ar is already sorted
# this segment is from source code of numpy
mask = np.empty(ar.shape, dtype=np.bool_)
mask[:1] = True
mask[1:] = ar[1:] != ar[:-1]
ret = ar[mask]

I meet the similar problem to yours. I wrote my unique function for my use. Because the pandas.unique doesn't support return_counts option. It is fast implementation. But my implementation only supports integers array. You can check out the source code here.

Comments

1

I also took a look at list(set()) and strings in a list, between pandas Series and python lists.

data = np.random.randint(0,10,100)
data_hex = [str(hex(n)) for n in data] # just some simple strings

sample1 = pd.Series(data, name='data')
sample2 = data.tolist()

sample3 = pd.Series(data_hex, name='data')
sample4 = data_hex

And then the benchmarks:

%timeit np.unique(sample1) # 16.4 µs ± 464 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.unique(sample2) # 15.9 µs ± 743 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.unique(sample3) # 45.8 µs ± 5.88 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.unique(sample4) # 20.6 µs ± 680 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%timeit pd.unique(sample1) # 60.3 µs ± 5.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit pd.unique(sample2) # 196 µs ± 18.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit pd.unique(sample3) # 79.7 µs ± 3.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit pd.unique(sample4) # 214 µs ± 61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each

%timeit list(set(sample1)) # 16.3 µs ± 1.63 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit list(set(sample2)) # 1.64 µs ± 83.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit list(set(sample3)) # 17.8 µs ± 1.96 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit list(set(sample4)) # 2.48 µs ± 439 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

The take away is:

  • Starting with a Pandas Series with integers? Go with either np.unique() or list(set())

  • Starting with a Pandas Series with strings? Go with list(set())

  • Starting with a list of integers? Go with list(set())

  • Starting with a list of strings? Go with list(set())

However, if N=1,000,000 instead, the results are different.

%timeit np.unique(sample1) # 26.5 ms ± 616 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.unique(sample2) # 98.1 ms ± 3.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.unique(sample3) # 1.31 s ± 78.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.unique(sample4) # 174 ms ± 2.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit pd.unique(sample1) # 10.5 ms ± 472 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit pd.unique(sample2) # 99.3 ms ± 5.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.unique(sample3) # 46.4 ms ± 4.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.unique(sample4) # 113 ms ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit list(set(sample1)) # 25.9 ms ± 2.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(set(sample2)) # 11.2 ms ± 496 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit list(set(sample3)) # 37.1 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(set(sample4)) # 20.2 ms ± 843 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
  • Starting with a Pandas Series with integers? Go with pd.unique()

  • Starting with a Pandas Series with strings? Go with list(set())

  • Starting with a list of integers? Go with list(set())

  • Starting with a list of strings? Go with list(set())

1 Comment

I would use frozenset and not set here, should be at least a little faster.
0

Here are some benchmarks:

In [72]: ar_list = [np.random.randint(0, 100, 1000) for _ in range(100)]

In [73]: %timeit map(np.unique, ar_list)
100 loops, best of 3: 4.9 ms per loop

In [74]: %timeit [np.unique(ar) for ar in ar_list]
100 loops, best of 3: 4.9 ms per loop

In [75]: %timeit [pd.unique(ar) for ar in ar_list] # using pandas
100 loops, best of 3: 2.25 ms per loop

So pandas.unique seems to be faster than numpy.unique. However the docstring mentions that the values are "not necessarily sorted", which (partly) explains, that it is faster. Using a list comprehension or map doesn't give a difference in this example.

Comments

0

I recently had to count the number of unique subarrays in a 32-bit 2D integer array, so neither pandas.unique nor pandas.value_counts were an option for me.

I've found that the following solution was faster by about a factor of 3 for an array with 10e7 items and a fairly non-uniform distribution:

import collections
unique_counts = collections.Counter(zip(*sequences_array.T))

How to efficiently convert 2d numpy array into 1d numpy array of tuples?

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.