0

I have a list with descriptors (arrays) and need to compute the cosine distance between then, which is a lot of computation. For a list with n vectors it does n * (n - 1) / 2 operations. I need to do that in a big n value, so, how can I speed up this process using multiprocessing? I found some answers but is not clear for me how to do in my case.

The code that I want to speed up is the code below:

from scipy.spatial.distance import cosine as cosine_distance

n = len(all_vectors)
all_distances = []
for i in range(n):
    for j in range(i + 1, n):
        x1 = all_vectors[i]
        x2 = all_vectors[j]
        all_distances.append(cosine_distance(x1, x2))

[UPDATE]

I also need to do a little label check, so, the original code is the following:

from scipy.spatial.distance import cosine as cosine_distance

n = len(all_vectors)
all_distances = []
all_label_checks = []
for i in range(n):
    for j in range(i + 1, n):
        x1 = all_vectors[i]
        x2 = all_vectors[j]
        all_distances.append(cosine_distance(x1, x2))
        
        label_x1 = all_labels[i]  # all_labels and all_distances are tied
        label_x2 = all_labels[j]
        all_label_checks.append(int(label_x1 == label_x2))

I tested the time of some suggestion, and the best answer so far is from @DaniMesejo. Here is the code that I used to test each case:

import time
import numpy as np

from scipy.spatial.distance import pdist, cosine as cosine_distance, squareform
from itertools import combinations, starmap


EMBEDDINGS_SIZE = 128
N_EMBEDDINGS = 1024


def get_n_embeddings(n):
    return np.random.randn(n, EMBEDDINGS_SIZE)


embeddings = get_n_embeddings(N_EMBEDDINGS)

###################
# scipy pdist
###################
start = time.time()
pdist_distances = pdist(embeddings, 'cosine')
end = time.time()
print(f"Using scipy pdsit: {end - start}")

###################
# nested loop
###################
nested_distances = []
start = time.time()
for i in range(N_EMBEDDINGS):
    for j in range(i + 1, N_EMBEDDINGS):
        x1 = embeddings[i]
        x2 = embeddings[j]
        nested_distances.append(cosine_distance(x1, x2))
end = time.time()
print(f"Using nested loop: {end - start}")

###################
# combinations
###################
start = time.time()
combination_distances = []
for x1, x2 in combinations(embeddings, 2):
    combination_distances.append(cosine_distance(x1, x2))
end = time.time()
print(f"Using combination: {end - start}")

###################
# starmap
###################
start = time.time()
starmap_distances = list(starmap(cosine_distance, combinations(embeddings, 2)))
end = time.time()
print(f"Using starmap: {end - start}")

print(np.allclose(pdist_distances, nested_distances))
print(np.allclose(pdist_distances, pdist_class_distances))
print(np.allclose(pdist_distances, combination_distances))
print(np.allclose(pdist_distances, starmap_distances))

And the results:

Using scipy pdsit: 0.0303647518157959
Using pdsit and class: 0.09841656684875488
Using nested loop: 13.415924549102783
Using combination: 13.093504428863525
Using starmap: 13.177483081817627
True
True
True
True

To solve my label problem I also can use pdist. I just need to transform my label list (that tells what is the label of each embedding) in a list of 2d and compute the pairwise distance. Pairs of the same label will have distance 0:

###################
# pdist and class
###################
classes = []
count = 0
id_ = 0
for i in range(N_EMBEDDINGS):
    classes.append(id_)
    count += 1
    if count % 4 == 0:
        count = 0
        id_ += 1
labels_x = [(i, i) for i in classes]
start = time.time()
pdist_class_distances = pdist(embeddings, 'cosine')
pdist_labels = pdist(labels_x, 'euclidean')
pdist_labels = [1 if x == 0.0 else 0 for x in pdist_labels]
end = time.time()
print(f"Using pdsit and class: {end - start}")
8
  • Not saying multiprocessing is necessarily the ideal solution, but typically he way you'd use multiprocessing is to have each process only work through a chunk of all_vectors, and then you'd concatenate the results together into one list at the end. Does that make sense? Commented Nov 20, 2020 at 20:20
  • An initial thing to look at would be to put the for j in range(i + 1, n) into a thread and collect the results as they complete. Take a look at some things like stackoverflow.com/questions/6893968/… to do an initial janky implementation and dip your toes in threading. Start by turning the content of the j for loop into a function that accepts i and n and returns x1/x2, or does coside_distance and returns the result of that since it's likely the heavy lifting going on Commented Nov 20, 2020 at 20:22
  • Did you check cdist using the cosine distance metric? Its capable of computing the distance between each pair of the two collections of inputs Commented Nov 20, 2020 at 20:25
  • @AsciiFace: Threading doesn't get you anything for CPU bound work in CPython unless the work is being done in a C extension that explicitly releases the GIL while doing a lot of work. You'd have to go with multiprocessing or concurrent.futures.ProcessPoolExecutor to do the work in separate processes (which aren't GIL bound, but aren't always faster due to IPC overhead if the work done is too small). Commented Nov 20, 2020 at 20:27
  • 1
    @AsciiFace: So you're technically wrong, but right in all ways that matter. :-) Commented Nov 20, 2020 at 20:49

2 Answers 2

2

A possible approach is to use the scipy function pdist, for example:

from scipy.spatial.distance import pdist, cosine as cosine_distance, squareform
import numpy as np


def all_pairs_using_pdist(X):
    """Function using pdist"""
    distances = squareform(pdist(X, 'cosine'))
    rows, _ = distances.shape
    return list(distances[np.triu_indices(rows, 1)])


def all_pairs_for_loops(all_vectors):
    """Function using a nested for loop"""
    n = len(all_vectors)
    all_distances = []
    for i in range(n):
        for j in range(i + 1, n):
            x1 = all_vectors[i]
            x2 = all_vectors[j]
            all_distances.append(cosine_distance(x1, x2))
    return all_distances

For a matrix like the following:

Y = np.random.randn(100, 25)

It gives the timings:

%timeit all_pairs_using_pdist(Y)
630 µs ± 28.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit all_pairs_for_loops(Y)
216 ms ± 78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

About 300 times faster. And of course both approaches, yield the same result:

Y = np.random.randn(100, 25)
print(np.allclose(all_pairs_for_loops(Y), all_pairs_using_pdist(Y)))

Output

True
Sign up to request clarification or add additional context in comments.

2 Comments

Your answer seems pretty good, thank you very much. I only have one extra doubt. How can I adapt your solution to do a little verification? My original problem contains all_vectors and all_labels. For each comparison between two vectors x1 and x2, I also check if they share the same label. I append the cosine distance in one list, and 1or 0 in another list if they are from the same class or not.
For example: python for i in range(n): for j in range(i + 1, n): x1 = all_vectors[i] x2 = all_vectors[j] all_distances.append(cosine_distance(x1, x2)) label_x1 = all_labels[i] label_x2 = all_labels[j] all_labels_check.append(int(label_x1==label_x2))
0

How much this will help I can't say, but your nested loop is basically an attempt to make all unique 2-combinations of legal indices to get all unique 2-combinations of the values. But there are built-ins to do this, so with a from itertools import combinations, this:

for i in range(n):
    for j in range(i + 1, n):
        x1 = all_vectors[i]
        x2 = all_vectors[j]

could simplify to this:

for x1, x2 in combinations(all_vectors, 2):

You could even layer all this over itertools.starmap to push the entire loop to the C layer (assuming cosine_distance is itself implemented in C; if not, it just limits the upside, but still works fine), simplifying all of your posted code to:

from scipy.spatial.distance import cosine as cosine_distance
from itertools import combinations, starmap

all_distances = list(starmap(cosine_distance, combinations(all_vectors, 2)))

and running faster to boot (meaningfully faster if cosine_distance is a relatively low cost operation implemented in C).

If that's not fast enough, odds are you'll have to use multiprocessing/concurrent.futures.ProcessPoolExecutor to parallelize (maybe multiprocessing.dummy/concurrent.futures.ThreadPoolExecutor if the scipy function is implemented in C and releases the GIL while running, allowing true thread parallelism, which is normally not available to CPU bound threads on CPython due to the GIL).

3 Comments

I already tested with itertools.combinations and doesn't speed up. I tried your suggestion of use starmap but also doesn't speed up. I test without calculating the cosine distance, it seems that the bottleneck is in to get all the possible combinations, even with built-in functions or not.
@MatheusSantos: How large is all_vectors? If it's large enough, it's going to take time; not much to improve (aside from maybe parallelism).
As big as possible. The solution of @DaniMesejo works, I only need to find how to include the label check that I mentioned in the update

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.