I am using a machine with 2 Xeon CPUs having 16 cores each. There are 2 NUMA domains, one for each CPU.
I have intensive computation, that also use a lot of memory, and everything is multithreaded. The global structure of the code is:
!$OMP PARALLEL DO
do i = 1, N
!$OMP CRITICAL
! allocate memory and populate it
!$OMP END CRITICAL
end do
!$OMP END PARALLEL DO
...
...
!$OMP PARALLEL DO
do i = 1, N
! main computations part 1
end do
!$OMP END PARALLEL DO
...
...
!$OMP PARALLEL DO
do i = 1, N
! main computations part 2
end do
!$OMP END PARALLEL DO
...
...
N is typically ~10000, and each iteration requires a few seconds.
About 50% of the data read/written in the computations at iteration #i are in the memory previsouly allocated at the iteration #i, and the remaining 50% are in the memory allocated in other iterations (but which tend to be close to #i).
Using the same static scheduling for all the loops ensures that at a given iteration #i, 50% of the memory that is accessed during the computations has been allocated by the same thread than the processed the iteration, hence that it is in the same NUMA domain.
Moreover, binding the threads with OMP_PROC_BIND and OMP_PLACES (threads 0-15 on the CPU #0 and threads 16-31 on the CPU #1) ensures that adjacent iterations have likely their allocated memory in the same NUMA domain.
So far so good...
The only issue is that the computational workload is not well balanced betwen the iterations. It's not too bad, but there can be up to +/-20%... Usually, using some dynamic scheduling at the computation stages would help, but here it would defeat the whole strategy of having the same thread allocating and then computing the iteration #i.
At least, I would like the iterations 1...N/2 being processed by the threads 0-15 and the iterations N/2+1...N being processed by the threads 16-31. So a first level of static chunking (2 chunks of size N/2), and a second level of dynamic scheduling inside each chunk. This would at least ensure that each thread would access memory mostly in the same NUMA domain.
But I can't see how to do that at all with OpenMP... Is it possible?
EDIT: schedule(nonmonotonic:dynamic) could have been a solution here, but on the HPC cluster I am using, I am stuck with compiler versions (Intel compiler 2021 at best) that do not implement this scheduling.
schedule(runtime)with theparallel dodirective. When running the application, setOMP_SCHEDULE=static_stealas an environment variable before you start the application. The loop is then partitioned statically at first, but when threads run out of work, they can steal from other threads. Does that solve your problem?schedule(nonmonotonic:dynamic)to be implemented using static stealing. This is actually allowed to be the default implementation ofschedule(dynamic), but it doesn't have to be, so being explicit about it is necessary if you want to be sure!schedule(nonmonotonic:dynamic), but the behavior is exactly the same asschedule(monotonic:dynamic)orschedule(dynamic). A few months ago I tested the version 2024, and the implementation is rather the "static stealing" one