1

I am using a machine with 2 Xeon CPUs having 16 cores each. There are 2 NUMA domains, one for each CPU.

I have intensive computation, that also use a lot of memory, and everything is multithreaded. The global structure of the code is:

!$OMP PARALLEL DO 
do i = 1, N
   !$OMP CRITICAL
   ! allocate memory and populate it
   !$OMP END CRITICAL
end do
!$OMP END PARALLEL DO 
...
...
!$OMP PARALLEL DO 
do i = 1, N
   ! main computations part 1
end do
!$OMP END PARALLEL DO 
...
...
!$OMP PARALLEL DO 
do i = 1, N
   ! main computations part 2
end do
!$OMP END PARALLEL DO 
...
...

N is typically ~10000, and each iteration requires a few seconds.

About 50% of the data read/written in the computations at iteration #i are in the memory previsouly allocated at the iteration #i, and the remaining 50% are in the memory allocated in other iterations (but which tend to be close to #i).

Using the same static scheduling for all the loops ensures that at a given iteration #i, 50% of the memory that is accessed during the computations has been allocated by the same thread than the processed the iteration, hence that it is in the same NUMA domain.

Moreover, binding the threads with OMP_PROC_BIND and OMP_PLACES (threads 0-15 on the CPU #0 and threads 16-31 on the CPU #1) ensures that adjacent iterations have likely their allocated memory in the same NUMA domain.

So far so good...

The only issue is that the computational workload is not well balanced betwen the iterations. It's not too bad, but there can be up to +/-20%... Usually, using some dynamic scheduling at the computation stages would help, but here it would defeat the whole strategy of having the same thread allocating and then computing the iteration #i.

At least, I would like the iterations 1...N/2 being processed by the threads 0-15 and the iterations N/2+1...N being processed by the threads 16-31. So a first level of static chunking (2 chunks of size N/2), and a second level of dynamic scheduling inside each chunk. This would at least ensure that each thread would access memory mostly in the same NUMA domain.

But I can't see how to do that at all with OpenMP... Is it possible?

EDIT: schedule(nonmonotonic:dynamic) could have been a solution here, but on the HPC cluster I am using, I am stuck with compiler versions (Intel compiler 2021 at best) that do not implement this scheduling.

7
  • 1
    The compiler should support "static stealing". To enable it, you need to use schedule(runtime) with the parallel do directive. When running the application, set OMP_SCHEDULE=static_steal as an environment variable before you start the application. The loop is then partitioned statically at first, but when threads run out of work, they can steal from other threads. Does that solve your problem? Commented Nov 13, 2024 at 7:24
  • 1
    Iteration size of a second sounds reasonable for using tasking. As long as you don't use the gomp runtime, locally created tasks should execute local on the thread until the queue runs empty and threads start to steal from other threads. So, the idea would be to create the tasks on the thread that should have the data. Commented Nov 13, 2024 at 8:11
  • Following up on @MichaelKlemm's note: in most runtime implementations, I would expect schedule(nonmonotonic:dynamic) to be implemented using static stealing. This is actually allowed to be the default implementation of schedule(dynamic), but it doesn't have to be, so being explicit about it is necessary if you want to be sure! Commented Nov 13, 2024 at 9:31
  • @MichaelKlemm I have tested with the Intel compiler 21 and 18 (the two ones I have to use to generate the applications), and it seems that they both support the static stealing. You can post this as an answer and I will accept it. Commented Nov 13, 2024 at 11:08
  • @JimCownie The Intel compiler (classic) 2021 does accept schedule(nonmonotonic:dynamic), but the behavior is exactly the same as schedule(monotonic:dynamic) or schedule(dynamic). A few months ago I tested the version 2024, and the implementation is rather the "static stealing" one Commented Nov 13, 2024 at 11:22

2 Answers 2

2

Considering a execution time in the order of second per iteration, tasks will not add significant runtime overhead. With LLVM/Intel OpenMP runtime, tasks will be queued thread-local and threads will start stealing once they are done with their iterations:

!$OMP PARALLEL DO SCHEDULE(static)
do i = 1, N
   !$OMP CRITICAL
   ! allocate memory and populate it
   !$OMP END CRITICAL
end do
!$OMP END PARALLEL DO 
...
...
!$OMP PARALLEL DO SCHEDULE(static)
do i = 1, N
!$OMP TASK
   ! main computations part 1
!$OMP END TASK 
end do
!$OMP END PARALLEL DO 
...
...
!$OMP PARALLEL DO SCHEDULE(static)
do i = 1, N
!$OMP TASK
   ! main computations part 2
!$OMP END TASK 
end do
!$OMP END PARALLEL DO 
...
...

The current GNU libgomp implementation has a single task queue, so that above code will have terrible congestion during task creation, but also no data locality.

To ensure binding of iterations to sockets, an alternative with nested parallel regions and using the runtime schedule suggested by @MichaelKlemm:

export KMP_HOT_TEAMS_MAX_LEVEL=2
export OMP_PLACES=sockets,cores
export OMP_PROC_BIND=spread,close
export OMP_NUM_THREADS=2,16
export OMP_SCHEDULE=static_steal
!$OMP PARALLEL private(nOuter)
nOuter = omp_get_num_threads()
!$OMP DO SCHEDULE(static)
do s = 1,nOuter
!$OMP PARALLEL DO SCHEDULE(static)
do i = 1 + (s-1)*(N/nOuter), s*N/nOuter
   !$OMP CRITICAL
   ! allocate memory and populate it
   !$OMP END CRITICAL
end do
!$OMP END PARALLEL DO 
end do
!$OMP END DO 
!$OMP END PARALLEL 
...
...
!$OMP PARALLEL private(nOuter)
nOuter = omp_get_num_threads()
!$OMP DO SCHEDULE(static)
do s = 1,nOuter
!$OMP PARALLEL DO SCHEDULE(runtime)
do i = 1 + (s-1)*(N/nOuter), s*N/nOuter
   ! main computations part 1
end do
!$OMP END PARALLEL DO 
end do
!$OMP END DO 
!$OMP END PARALLEL 
...
...
!$OMP PARALLEL private(nOuter)
nOuter = omp_get_num_threads()
!$OMP DO SCHEDULE(static)
do s = 1,nOuter
!$OMP PARALLEL DO SCHEDULE(runtime)
do i = 1 + (s-1)*(N/nOuter), s*N/nOuter
   ! main computations part 2
end do
!$OMP END PARALLEL DO 
end do
!$OMP END DO 
!$OMP END PARALLEL 
...
...
Sign up to request clarification or add additional context in comments.

4 Comments

Indeed the overheads of tasks would be negligible in my case. However, I prefer not relying on a behavior that is fully implementation-dependent if there are other solutions. You second solution is exactly inline with my initial question, and it would be worth testing at some point.
@PierU using static_steal is also an implementation-dependent solution ;)
Ah, yes, I didn't notice that is not part of the OpenMP specification, but rather an extension in the Intel implementation. When not supported, I guess it would revert to the default static scheduling. So the question is rather "which one of the task or the static_steal options would be the least penalizing if not implemented as expected?", and the answer is problem dependent.
@PierU effectively both works with Intel/llvm based OpenMP implementations, so it shouldn't be a big deal. The implementation specific value is the reason for going through runtime schedule rather than using the schedule directly.
1

The specific Intel compiler compiler version should support "static stealing". To enable it, you need to use schedule(runtime) with the parallel do directive, like so:

!$omp parallel do schedule(runtime)

When running the application, set OMP_SCHEDULE=static_steal as an environment variable before you start the application, e.g, for bash-like shells:

export OMP_SCHEDULE=static_steal

or via localize environment:

OMP_SCHEDULE=static_steal ./my-application

The loop is then partitioned statically at first, but when threads run out of work, they can steal from other threads. Does that solve your problem?

1 Comment

It is indeed a simple solution to my question. At the end, however, I do not get a significant speed-up by binding the threads to the core, allocating and processing with the same threads, using static_steal, etc... But it is unrelated to static_steal in itself, it's maybe just that the memory accesses are far from being a bottleneck in my code. At least I have learnt about static_steal, which could be useful in other codes of mine anyway.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.