2

The whole program has been shrunk to a simple test:

    const int loops = 1e10;
    int j[4] = { 1, 2, 3, 4 };
    time_t time = std::time(nullptr);
    for (int i = 0; i < loops; i++) j[i % 4] += 2;
    std::cout << std::time(nullptr) - time << std::endl;

    int k[4] = { 1, 2, 3, 4 };
    omp_set_num_threads(4);
    time = std::time(nullptr);
#pragma omp parallel for
    for (int i = 0; i < loops; i++) k[omp_get_thread_num()] += 2;
    std::cout << std::time(nullptr) - time << std::endl;

In the first case it takes about 3 seconds to run through the loop, in the second case the result is inconsistent and may be 4 - 9 seconds. Both of the loops run faster with some optimization enabled (like favouring speed and whole program optimization), but the second loop is still significantly slower. I tried adding barrier at the end of the loop and explicitly specifying the array as shared, but that didn't help. The only case when I managed to make the parallel loop run faster is by making the loops empty. What may be the problem?

Windows 10 x64, CPU Intel Core i5 10300H (4 cores)

11
  • 2
    The measuring technique is always suspect number one. Commented Apr 6, 2022 at 13:24
  • @Bathsheba it is present in the code, what's wrong with it? Commented Apr 6, 2022 at 13:25
  • A few suspects as pointer: i) 1e10 is out of int range, ii) compilers often optimize out these for loops, you need to use volatile int, iii) the measuring technique does not reflect actual cpu time. Commented Apr 6, 2022 at 13:28
  • 6
    The k[omp_get_thread_num()] in your code is causing false sharing, as the cache line that contains the k array is being sent back and forth between the caches of the system. If you change the definition of the array to be like 4*16 integers and if you use k[omp_get_thread_num() * 16] instead the problem should be gone. Commented Apr 6, 2022 at 13:31
  • 1
    A typical cacheline size is 64 bytes. Commented Apr 6, 2022 at 14:05

1 Answer 1

2

As already pointed out in the various comments, the crux of your problem is false sharing. Indeed, your example is the typical case where one can experiment this. However, there are also quite a few issues in your code, such as:

  • You will likely see overflows, both in your loops variable and in all of your j and k tables;
  • Your timer isn't really the best choice ever (admittedly, that is a bit pedantic from my part in this case);
  • You do not make use of the values you computed, which allows the compiler to completely ignore the various computations;
  • Your two loops are not equivalent and won't give the same results; In order to get it right, I went back to your original i%4 formula and added a schedule( static, 1) clause. This is not a proper way of doing it, but it was only to get the expected results without using the correct reduction clause.

Then I rewrote your example and also augmented it with what I believe is a better solution to the false sharing issue: using a reduction clause.

#include <iostream>
#include <omp.h>

int main() {
    const long long loops = 1e10;
    long long j[4] = { 1, 2, 3, 4 };
    double time = omp_get_wtime();
    for ( long long i = 0; i < loops; i++ ) {
         j[i % 4] += 2;
    }
    std::cout << "sequential: " << omp_get_wtime() - time << std::endl;

    time = omp_get_wtime();
    long long k[4] = { 1, 2, 3, 4 };
    #pragma omp parallel for num_threads( 4 ) schedule( static, 1 )
    for ( long long i = 0; i < loops; i++ ) {
        k[i%4] += 2;
    }
    std::cout << "false sharing: " << omp_get_wtime() - time << std::endl;

    time = omp_get_wtime();
    long long l[4] = { 1, 2, 3, 4 };
    #pragma omp parallel for num_threads( 4 ) reduction( +: l[0:4] )
    for ( long long i = 0; i < loops; i++ ) {
        l[i%4] += 2;
    }
    std::cout << "reduction: " << omp_get_wtime() - time << std::endl;

    bool a = j[0]==k[0] && j[1]==k[1] && j[2]==k[2] && j[3]==k[3];
    bool b = j[0]==l[0] && j[1]==l[1] && j[2]==l[2] && j[3]==l[3];
    std::cout << "sanity check: " << a << " " << b << std::endl;

    return 0;
}

Compiling and running without optimizations gives on my laptop:

$ g++ -O0 -fopenmp false.cc 
$ ./a.out 
sequential: 15.5384
false sharing: 47.1417
reduction: 4.7565
sanity check: 1 1

Which speaks for itself in regard to the improvement the reduction clause brings. Now, enabling optimizations from the compiler gives a more mitigated picture:

$ g++ -O3 -fopenmp false.cc 
$ ./a.out 
sequential: 4.8414
false sharing: 4.10714
reduction: 2.10953
sanity check: 1 1

If anything, that shows that compilers are quite good at avoiding most of false sharing nowadays. Indeed, with your initial (erroneous) k[omp_get_thread_num()], there was no time difference with and without the reduction clause, showing that the compiler was able to avoid the issue.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for the answer! What about using (static, 8) scheduling? Theoretically, each thread is then working with a contituous block of memory sized 64 bytes each (presuming long long is 8 bytes, I omit the fact that its size may vary for the sake of simplicity here). Should this one work? I'm asking because in my case it didn't
The schedule( static, 1 ) I added has nothing to do with preventing false sharing. It was there only to ensure that each thread was accessing only the elements of the array k that corresponded to its own rank. Without it, the default schedule would have been (with my compiler) a simple static one, with contiguous chunks of indexes of size loops/4. Then, each thread would have accessed all possible indexes of k for incrementing. In this case, one would need to protect these accesses between threads with either atomic or critical, which essentially re-serializes the parallel loop
In your case, unless your actual code differs significantly from your snippet, the reduction clause is the way to go

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.