Using MPI correctly with multithreaded NumPy functions

Question

After reading the mpi4py documentation, I can't find any information about the use of MPI with multithreaded NumPy applications. That is, most of the mpi4py documentation related to NumPy involve data movement operations (e.g., Bcast, Scatter, Gather), but I'm interested in parallelizing heavier computational procedures, for instance linear system solves with np.linalg.solve, which I understand can utilize multiple cores/threads when NumPy is linked against several libraries. What's missing from the documentation is guidance for the number of MPI processes to use when each process may itself utilize multiple threads/cores, as in the example below. Similarly, if there was a way to determine the exact number of threads/cores used by a given NumPy procedure, then presumably some basic arithmetic could be used to determine how many MPI processes to use.

How can I reason about the number of MPI processes that can/should be used given the goal is to parallelize a NumPy program which itself is using multiple threads?

The following basic example on my laptop (with support for 12 threads) has runtime which scales essentially linearly with respect to the number of MPI processes launched, leading me to believe the code is being serialized behind the scenes, most likely since the BLAS/LAPACK routine is already using multiple cores/threads.

import time 

from mpi4py import MPI 
import numpy as np 
import numpy.random as npr 

world = MPI.COMM_WORLD
world_size: int = world.Get_size()
rank: int = world.Get_rank()
master: bool = (rank == 0)

n: int = 8_000 # corresponds to ~2.5s runtime on my laptop 
A: np.ndarray = npr.randint(0, 10, (n, n))
b: np.ndarray = npr.randint(0, 10, (n,))
start: float = time.perf_counter() 
_ = np.linalg.solve(A, b) 
runtime: float = time.perf_counter() - start 
if master: 
    print(f"{runtime=:.3f}s")

I would expect that launching this with mpiexec -n 2 would be the same runtime as mpiexec -n 1, but this is not the case, it takes approximately twice as long. Note that if I do not use np.linalg.solve but a function like time.sleep, using twice as many processes does not increase the runtime.

EDIT: To clarify what seems to be a misunderstanding here, this example is meant to stand-in for an application in which we are hoping to complete a distinct, independent linear system solve (one per process). I.e., the data (A, b) would be distinct for each process.

Are you sure both MPI tasks cooperate to solve the system? Or do they both solve the same system twice? If the latter, since cores and memory bandwidth are shared, it is not surprising the elapsed time increase linearly with the number of tasks. — Gilles Gouaillardet
– Gilles Gouaillardet, Commented Aug 12, 2024 at 15:21
1. Number of MPI processes times number of threads equals number of cores. 2. Your code does the same solve on each process. That seems pointless. — Victor Eijkhout
– Victor Eijkhout, Commented Aug 12, 2024 at 15:22
The processes are independent. Each process is solving the same problem (in this case, they are possibly the exact same nxn system, though that depends on how numpy's random generator works in parallel here). But the fact that it is taking twice as long doesn't seem to make sense to me, since each process should complete the task in approximately the same amount of time. Therefore, I'd expect the elapsed time to be constant, not increasing linearly (and definitely not decreasing). — jared
– jared, Commented Aug 12, 2024 at 16:03
@VictorEijkhout this is a pedagogical example. In a real application of course A and b are distinct per process. — Nick Richardson
– Nick Richardson, Commented Aug 12, 2024 at 16:21
Numpy might use all the cores under the hood, so when the operation is compute bound, you end up sharing resources when increasing the number of MPI tasks, and hence the elapsed time increase since there is more work to do. — Gilles Gouaillardet
– Gilles Gouaillardet, Commented Aug 12, 2024 at 23:01

Nick ODell · Accepted Answer · 2024-08-12 17:44:55Z

How can I reason about the number of MPI processes that can/should be used given the goal is to parallelize a NumPy program which itself is using multiple threads?

In general, you want the NumPy parallelism level times the MPI parallelism level to equal the number of logical cores.

A reasonable guess for this, if you have two nested levels of parallelism, is that the number of MPI processes should be the square root of the number of logical cores rounded down to the nearest integer, and that the NumPy parallelism should be the number of logical cores divided by the number of MPI processes rounded down to the nearest integer.

For your case, that would result in 3 MPI processes, each with a NumPy threadpool containing 4 threads.

However, I would emphasize that this is just a starting point, and you should determine what works best for your application experimentally. At minimum, you should also try no MPI parallelism, and full NumPy parallelism, and full MPI parallelism and no NumPy parallelism.

Similarly, if there was a way to determine the exact number of threads/cores used by a given NumPy procedure, then presumably some basic arithmetic could be used to determine how many MPI processes to use.

In general, if you do not configure it, NumPy will use all cores for algorithms which can be parallelized, and one core for algorithms which cannot be parallelized.

The algorithms which can use multiple cores are not usually documented, and may change in different versions of your BLAS library. For that reason, I suggest checking experimentally.

If you want to check experimentally if a particularly NumPy algorithm uses multiple cores, you can measure real time using time.perf_counter_ns(), and measure CPU clock time using time.process_time_ns(). The change in CPU clock time divided by the change in real time gives you the average concurrency for a process.

Example:

import numpy as np
import time


def measure_func(func, repeat=10):
    time.sleep(1)
    real_start = time.perf_counter_ns()
    cpu_start = time.process_time_ns()
    for i in range(repeat):
        func()
    real_end = time.perf_counter_ns()
    cpu_end = time.process_time_ns()
    real_delta = real_end - real_start
    cpu_delta = cpu_end - cpu_start
    print(f"Time taken: {real_delta / 1e9:.3f}s")
    print(f"CPU time taken: {cpu_delta / 1e9:.3f}s")
    print(f"Average concurrency: {cpu_delta / real_delta:.3f}")
    

A = np.random.rand(1000, 1000)
b = np.random.rand(1000, 1000)

print("Matrix solve")
measure_func(lambda: np.linalg.solve(A, b))
print("Adding matrix")
measure_func(lambda: A + b, repeat=1000)

NumPy's choice to use either all cores or a single core is usually not optimal. I would recommend tuning it. You can do this either via the environment variable OMP_NUM_THREADS, or the library threadpoolctl. I recommend the threadpoolctl library. It allows you to change the thread pool size while your program is running, and is more convenient. Try changing the above program so that np.linalg.solve() uses only one core.

Just a note that threadpoolctl doesn't seem to work properly on my M3 macbook pro as of the time of this writing. Their thread limit constructs don't seem to have any effect, but when I checked on a separate Ubuntu/x86 instance, this guidance proved helpful. Thanks!
@NickRichardson That sounds like a bug. Do you mind opening an issue on the threadpoolctl issue tracker to make them aware?

Collectives™ on Stack Overflow

Using MPI correctly with multithreaded NumPy functions

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related