I am looking to speed up the following piece of code:
NNlist=[np.unique(i) for i in NNlist]
where NNlist is a list of np.arrays with duplicated entries.
Thanks :)
numpy.unique is already pretty optimized, you're not likely to get get much of a speedup over what you already have unless you know something else about the underlying data. For example if the data is all small integers you might be able to use numpy.bincount or if the unique values in each of the arrays are mostly the same there might be some optimization that could be done over the whole list of arrays.
pandas.unique() is much faster than numpy.unique(). The Pandas version does not sort the result, but you can do that yourself and it will still be much faster if the result is much smaller than the input (i.e. there are a lot of duplicate values):
np.sort(pd.unique(arr))
Timings:
In [1]: x = np.random.randint(10, 20, 50000000)
In [2]: %timeit np.sort(pd.unique(x))
201 ms ± 9.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [3]: %timeit np.unique(x)
1.49 s ± 27.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The numpy.unique() is based on sorting (quicksort), and the pandas.unique() is based on hash table. Normally, the latter is faster according to my benchmarks. They are already very optimized.
For some special case, you can continue to optimize the performance.
For example, if the data already sorted, you can skip the sorting method:
# ar is already sorted
# this segment is from source code of numpy
mask = np.empty(ar.shape, dtype=np.bool_)
mask[:1] = True
mask[1:] = ar[1:] != ar[:-1]
ret = ar[mask]
I meet the similar problem to yours. I wrote my unique function for my use. Because the pandas.unique doesn't support return_counts option. It is fast implementation. But my implementation only supports integers array. You can check out the source code here.
I also took a look at list(set()) and strings in a list, between pandas Series and python lists.
data = np.random.randint(0,10,100)
data_hex = [str(hex(n)) for n in data] # just some simple strings
sample1 = pd.Series(data, name='data')
sample2 = data.tolist()
sample3 = pd.Series(data_hex, name='data')
sample4 = data_hex
And then the benchmarks:
%timeit np.unique(sample1) # 16.4 µs ± 464 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.unique(sample2) # 15.9 µs ± 743 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.unique(sample3) # 45.8 µs ± 5.88 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.unique(sample4) # 20.6 µs ± 680 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit pd.unique(sample1) # 60.3 µs ± 5.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit pd.unique(sample2) # 196 µs ± 18.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit pd.unique(sample3) # 79.7 µs ± 3.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit pd.unique(sample4) # 214 µs ± 61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each
%timeit list(set(sample1)) # 16.3 µs ± 1.63 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit list(set(sample2)) # 1.64 µs ± 83.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit list(set(sample3)) # 17.8 µs ± 1.96 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit list(set(sample4)) # 2.48 µs ± 439 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The take away is:
Starting with a Pandas Series with integers? Go with either np.unique() or list(set())
Starting with a Pandas Series with strings? Go with list(set())
Starting with a list of integers? Go with list(set())
Starting with a list of strings? Go with list(set())
However, if N=1,000,000 instead, the results are different.
%timeit np.unique(sample1) # 26.5 ms ± 616 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.unique(sample2) # 98.1 ms ± 3.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.unique(sample3) # 1.31 s ± 78.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.unique(sample4) # 174 ms ± 2.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pd.unique(sample1) # 10.5 ms ± 472 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit pd.unique(sample2) # 99.3 ms ± 5.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.unique(sample3) # 46.4 ms ± 4.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.unique(sample4) # 113 ms ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(set(sample1)) # 25.9 ms ± 2.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(set(sample2)) # 11.2 ms ± 496 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit list(set(sample3)) # 37.1 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(set(sample4)) # 20.2 ms ± 843 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Starting with a Pandas Series with integers? Go with pd.unique()
Starting with a Pandas Series with strings? Go with list(set())
Starting with a list of integers? Go with list(set())
Starting with a list of strings? Go with list(set())
frozenset and not set here, should be at least a little faster.Here are some benchmarks:
In [72]: ar_list = [np.random.randint(0, 100, 1000) for _ in range(100)]
In [73]: %timeit map(np.unique, ar_list)
100 loops, best of 3: 4.9 ms per loop
In [74]: %timeit [np.unique(ar) for ar in ar_list]
100 loops, best of 3: 4.9 ms per loop
In [75]: %timeit [pd.unique(ar) for ar in ar_list] # using pandas
100 loops, best of 3: 2.25 ms per loop
So pandas.unique seems to be faster than numpy.unique. However the docstring mentions that the values are "not necessarily sorted", which (partly) explains, that it is faster.
Using a list comprehension or map doesn't give a difference in this example.
I recently had to count the number of unique subarrays in a 32-bit 2D integer array, so neither pandas.unique nor pandas.value_counts were an option for me.
I've found that the following solution was faster by about a factor of 3 for an array with 10e7 items and a fairly non-uniform distribution:
import collections
unique_counts = collections.Counter(zip(*sequences_array.T))
How to efficiently convert 2d numpy array into 1d numpy array of tuples?
NNlistactually a multi dimensional numpy array ? (and why would you then want the result to be a Pythonlist?)map(np.unique, NNlist)might be a good place to start.