I am trying to find the minimum image separation between two sets of particles. I have about 40 particles in each set, and their position vectors (three dimensional) are stored in two arrays of dimension (40, 3). I have to compute the Euclidean distance between each of the particles in one set, and each of the particles in the other, after applying the minimum image criterion. To make it clearer, a one-dimensional equivalent of the same would be, for two lists pos1 and pos2 with the coordinates, [func(i-j) for i in pos1 for j in pos2] where func = lambda x: x - np.rint(x/width)*width is a function which applies the minimum criterion.
In three dimensions, the Euclidean distance would be numpy.sqrt(dx**2 + dy**2 + dx**2) where dx, dy, and dz are returned by func for each dimension.
(The function func is just to demonstrate how minimum image criterion is applied. I do not use the same exact program structure.)
I am looking for an efficient way to do this, as I have to do the same operations as part of analysing multiple data sets, each with about 20000 time steps, and each time step containing 3 sets of 40 particles each, i.e. 6 combinations of sets to compute for each timestep, in each data set.
Googling led me to scipy.spatial.distance.cdist but I am having trouble optimising the computations time. The inbuilt routines for distances (Euclidean, Minkowski, Manhattan, Chebyshev etc.) are optimised and run pretty fast (upto three orders of magnitude in my tests below), in comparison to explicit function definitions given as arguments:
In [1]: import numpy as np
In [2]: from scipy.spatial.distance import cdist, euclidean
In [3]: %%timeit
...: pos1 = np.random.rand(40, 3)
...: pos2 = np.random.rand(40, 3)
...: cdist(pos1, pos2, metric='euclidean')
...:
The slowest run took 12.46 times longer than the fastest.
This could mean that an intermediate result is being cached
10000 loops, best of 3: 39.3 µs per loop
In [4]: %%timeit
...: pos1 = np.random.rand(40, 3)
...: pos2 = np.random.rand(40, 3)
...: cdist(pos1, pos2, metric=euclidean)
...:
10 loops, best of 3: 43 ms per loop
In [5]: %%timeit
...: pos1 = np.random.rand(40, 3)
...: pos2 = np.random.rand(40, 3)
...: cdist(pos1, pos2, lambda u, v: np.sqrt(((u-v)**2).sum()) )
...:
100 loops, best of 3: 15.5 ms per loop
In [6]: width = 1.0
In [7]: func = lambda x: x - np.rint(x/width)*width
In [8]: %%timeit
...: pos1 = np.random.rand(40, 3)
...: pos2 = np.random.rand(40, 3)
...: cdist(pos1, pos2, lambda u, v: np.sqrt(((func(u)-func(v))**2).sum()) )
...:
10 loops, best of 3: 31.2 ms per loop
Here is what I have considered as options:
- Explicitly loop over array elements and build the required array (probably the least efficient)
- Separate array into the three
x, y, xcomponents, apply minimum image criterion, and usecdistfor calculating euclidean distances for each component individually (becausenumpy.sqrt(dx**2) == dxand so on), reconstruct(40, 3)array from the component arrays, and repeatcdistfor calculating distances in 3D
What would be an efficient way to calculate the equivalent of cdist(pos1, pos2, lambda u, v: np.sqrt(((func(u)-func(v))**2).sum()) )?
Question:
Is there any inbuilt NumPy function that can give an equivalent of [(i-j) for i in pos1 for j in pos2], but for specified axis of two arrays?
An example representation for what I intend to achieve:
[ a 0 0 ] [ x 0 0 ]
A = [ b 0 0 ] ; B = [ y 0 0 ]
[ c 0 0 ] [ z 0 0 ]
[ a-x 0 0 ]
[ a-y 0 0 ]
[ a-z 0 0 ]
[ b-x 0 0 ]
Result = [ b-y 0 0 ]
[ b-z 0 0 ]
[ c-x 0 0 ]
[ c-y 0 0 ]
[ c-z 0 0 ]
(All values are float, and the operation is to be done for all columns.)