OpenMP parallel loop much slower than regular loop

Question

The whole program has been shrunk to a simple test:

    const int loops = 1e10;
    int j[4] = { 1, 2, 3, 4 };
    time_t time = std::time(nullptr);
    for (int i = 0; i < loops; i++) j[i % 4] += 2;
    std::cout << std::time(nullptr) - time << std::endl;

    int k[4] = { 1, 2, 3, 4 };
    omp_set_num_threads(4);
    time = std::time(nullptr);
#pragma omp parallel for
    for (int i = 0; i < loops; i++) k[omp_get_thread_num()] += 2;
    std::cout << std::time(nullptr) - time << std::endl;

In the first case it takes about 3 seconds to run through the loop, in the second case the result is inconsistent and may be 4 - 9 seconds. Both of the loops run faster with some optimization enabled (like favouring speed and whole program optimization), but the second loop is still significantly slower. I tried adding barrier at the end of the loop and explicitly specifying the array as shared, but that didn't help. The only case when I managed to make the parallel loop run faster is by making the loops empty. What may be the problem?

Windows 10 x64, CPU Intel Core i5 10300H (4 cores)

A few suspects as pointer: i) 1e10 is out of int range, ii) compilers often optimize out these for loops, you need to use volatile int, iii) the measuring technique does not reflect actual cpu time. — haleyk
– haleyk, Commented Apr 6, 2022 at 13:28
The k[omp_get_thread_num()] in your code is causing false sharing, as the cache line that contains the k array is being sent back and forth between the caches of the system. If you change the definition of the array to be like 4*16 integers and if you use k[omp_get_thread_num() * 16] instead the problem should be gone. — Michael Klemm
– Michael Klemm, Commented Apr 6, 2022 at 13:31

Gilles · Accepted Answer · 2022-04-07 07:25:49Z

2

As already pointed out in the various comments, the crux of your problem is false sharing. Indeed, your example is the typical case where one can experiment this. However, there are also quite a few issues in your code, such as:

You will likely see overflows, both in your loops variable and in all of your j and k tables;
Your timer isn't really the best choice ever (admittedly, that is a bit pedantic from my part in this case);
You do not make use of the values you computed, which allows the compiler to completely ignore the various computations;
Your two loops are not equivalent and won't give the same results; In order to get it right, I went back to your original i%4 formula and added a schedule( static, 1) clause. This is not a proper way of doing it, but it was only to get the expected results without using the correct reduction clause.

Then I rewrote your example and also augmented it with what I believe is a better solution to the false sharing issue: using a reduction clause.

#include <iostream>
#include <omp.h>

int main() {
    const long long loops = 1e10;
    long long j[4] = { 1, 2, 3, 4 };
    double time = omp_get_wtime();
    for ( long long i = 0; i < loops; i++ ) {
         j[i % 4] += 2;
    }
    std::cout << "sequential: " << omp_get_wtime() - time << std::endl;

    time = omp_get_wtime();
    long long k[4] = { 1, 2, 3, 4 };
    #pragma omp parallel for num_threads( 4 ) schedule( static, 1 )
    for ( long long i = 0; i < loops; i++ ) {
        k[i%4] += 2;
    }
    std::cout << "false sharing: " << omp_get_wtime() - time << std::endl;

    time = omp_get_wtime();
    long long l[4] = { 1, 2, 3, 4 };
    #pragma omp parallel for num_threads( 4 ) reduction( +: l[0:4] )
    for ( long long i = 0; i < loops; i++ ) {
        l[i%4] += 2;
    }
    std::cout << "reduction: " << omp_get_wtime() - time << std::endl;

    bool a = j[0]==k[0] && j[1]==k[1] && j[2]==k[2] && j[3]==k[3];
    bool b = j[0]==l[0] && j[1]==l[1] && j[2]==l[2] && j[3]==l[3];
    std::cout << "sanity check: " << a << " " << b << std::endl;

    return 0;
}

Compiling and running without optimizations gives on my laptop:

$ g++ -O0 -fopenmp false.cc 
$ ./a.out 
sequential: 15.5384
false sharing: 47.1417
reduction: 4.7565
sanity check: 1 1

Which speaks for itself in regard to the improvement the reduction clause brings. Now, enabling optimizations from the compiler gives a more mitigated picture:

$ g++ -O3 -fopenmp false.cc 
$ ./a.out 
sequential: 4.8414
false sharing: 4.10714
reduction: 2.10953
sanity check: 1 1

If anything, that shows that compilers are quite good at avoiding most of false sharing nowadays. Indeed, with your initial (erroneous) k[omp_get_thread_num()], there was no time difference with and without the reduction clause, showing that the compiler was able to avoid the issue.

edited Apr 7, 2022 at 7:25

answered Apr 7, 2022 at 7:12

Gilles

9,5694 gold badges39 silver badges55 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Kaiyakha Over a year ago

Thanks for the answer! What about using (static, 8) scheduling? Theoretically, each thread is then working with a contituous block of memory sized 64 bytes each (presuming long long is 8 bytes, I omit the fact that its size may vary for the sake of simplicity here). Should this one work? I'm asking because in my case it didn't

Gilles Over a year ago

The schedule( static, 1 ) I added has nothing to do with preventing false sharing. It was there only to ensure that each thread was accessing only the elements of the array k that corresponded to its own rank. Without it, the default schedule would have been (with my compiler) a simple static one, with contiguous chunks of indexes of size loops/4. Then, each thread would have accessed all possible indexes of k for incrementing. In this case, one would need to protect these accesses between threads with either atomic or critical, which essentially re-serializes the parallel loop

Gilles Over a year ago

In your case, unless your actual code differs significantly from your snippet, the reduction clause is the way to go

Collectives™ on Stack Overflow

OpenMP parallel loop much slower than regular loop

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related