After reading the mpi4py documentation, I can't find any information about the use of MPI with multithreaded NumPy applications. That is, most of the mpi4py documentation related to NumPy involve data movement operations (e.g., Bcast, Scatter, Gather), but I'm interested in parallelizing heavier computational procedures, for instance linear system solves with np.linalg.solve, which I understand can utilize multiple cores/threads when NumPy is linked against several libraries. What's missing from the documentation is guidance for the number of MPI processes to use when each process may itself utilize multiple threads/cores, as in the example below. Similarly, if there was a way to determine the exact number of threads/cores used by a given NumPy procedure, then presumably some basic arithmetic could be used to determine how many MPI processes to use.
How can I reason about the number of MPI processes that can/should be used given the goal is to parallelize a NumPy program which itself is using multiple threads?
The following basic example on my laptop (with support for 12 threads) has runtime which scales essentially linearly with respect to the number of MPI processes launched, leading me to believe the code is being serialized behind the scenes, most likely since the BLAS/LAPACK routine is already using multiple cores/threads.
import time
from mpi4py import MPI
import numpy as np
import numpy.random as npr
world = MPI.COMM_WORLD
world_size: int = world.Get_size()
rank: int = world.Get_rank()
master: bool = (rank == 0)
n: int = 8_000 # corresponds to ~2.5s runtime on my laptop
A: np.ndarray = npr.randint(0, 10, (n, n))
b: np.ndarray = npr.randint(0, 10, (n,))
start: float = time.perf_counter()
_ = np.linalg.solve(A, b)
runtime: float = time.perf_counter() - start
if master:
print(f"{runtime=:.3f}s")
I would expect that launching this with mpiexec -n 2 would be the same runtime as mpiexec -n 1, but this is not the case, it takes approximately twice as long. Note that if I do not use np.linalg.solve but a function like time.sleep, using twice as many processes does not increase the runtime.
EDIT: To clarify what seems to be a misunderstanding here, this example is meant to stand-in for an application in which we are hoping to complete a distinct, independent linear system solve (one per process). I.e., the data (A, b) would be distinct for each process.