Issues when parallelizing for loop to perform array assignments

Question

I am learning omp and came across this issue... consider following code where:

Arrays a, b, and c are initialized

#pragma omp parallel num_threads(4)
{
    #pragma omp for schedule(static, 64)
    for(int i = 0; i < 256; i++)
    {
        d[i] = a[i];
        if( i + 1 < 256 )
            d[i+1] = b[i];
        if( i + 2 < 256 )
            d[i+2] = c[i];
    }
}

While running this code multiple times, following observations are noted:

Random results (correct/in-correct) are seen
in-correct values are assigned to d at index i=64 or i=128

Considering the number of threads assigned (4) and number of iterations (256), I applied schedule(static, 64) with chunk size 64. I am assuming scheduling with static will make the threads run in sequence. I even applied #pragma omp critical over assignments but that didn't work and random behavior persisted.

Please consider using OMP_NUM_THREADS instead of a hard-coded num_threads(4) which is a bad practice. Threads does not run in sequence, because otherwise it would not be parallel in the first place. Here, it defines which thread compute which items (i.e. partitioning). — Jérôme Richard
– Jérôme Richard, Commented Oct 31, 2024 at 9:18
I am afraid this loop is mostly sequential because of the d accesses. And even if it would not be a problem, a loop of 256 item copy is cheap (I assume copied objects are cheap) while creating threads and distributing the work is certainly more expensive than running this in sequential... Why do you want to parallelize such a loop in the first place? — Jérôme Richard
– Jérôme Richard, Commented Oct 31, 2024 at 9:19
Thanks for the suggestion using OMP_NUM_THREADS.. since i am learning omp so the loop size is not an issue here . just wanted to clear concepts doing small codes — ABUL_SALASA
– ABUL_SALASA, Commented Oct 31, 2024 at 10:27

pptaszni · Accepted Answer · 2024-10-31 13:59:28Z

6

Your schedule(static, 64) only tells that 256 iterations should be distributed in chunks of 64, ~~but it doesn't determine which thread will be responsible for which iteration~~ between threads in the team in a round-robin fashion (OpenMP-API 11.5.3 scheduleClause), and it doesn't guarantee that the threads will be executed in any fixed order. I found it nicely explained in these 2 articles:

So, you have a race condition + data race (thx Joachim). And you will still have these problems even if you add

        #pragma omp critical
        d[i] = a[i];
        if( i + 1 < 256 )
            #pragma omp critical
            d[i+1] = b[i];
        if( i + 2 < 256 )
            #pragma omp critical
            d[i+2] = c[i];

It means only 1 thread at a time is allowed to execute one of these 3 instructions. But consider that Thread1 doing iteration 69 is on the line d[i+1] = b[i]; and Thread2 doing iteration 68 is on the line d[i+2] = c[i];. Still race condition.

edited Oct 31, 2024 at 13:59

answered Oct 31, 2024 at 6:57

pptaszni

9,3077 gold badges35 silver badges60 bronze badges

Sign up to request clarification or add additional context in comments.

19 Comments

Joachim Over a year ago

It's not only a race condition, but it is also still a data race (which makes the behavior undefined). critical, the Bazooka of synchronization is not even necessary, atomic write would be sufficient. But, also the d[i] = a[i] assignment would need the atomic write directive. This would fix the data race, as atomic writes sufficiently synchronize to avoid data race. The race condition persists, because the atomic write will not order the writes to the same variable, so each variable still can randomly have the value of a[i], b[i-1], or c[i-2]

pptaszni Over a year ago

Yep true, added this detail.

Jérôme Richard Over a year ago

Both #omp critical and #omp atomic will make the loop so slow that it would not worth using multiple threads (even certainly on many-core CPUs). Atomics are nothing more than locks at a cache line level on all mainstream CPUs with an additional huge overhead compared to classical accesses (generally dependent of the number of cores so atomics on the same small portion of memory never scale).

ABUL_SALASA Over a year ago

Thanks for the reply mate .... I read the document which stated using schedule(dynamic, chunk_size) wont ensure the sequence in which threads shall run... whereas using static will make the threads run in sequence starting from Thread0, Thread1, Thread2, ... and so on .

pptaszni Over a year ago

Mhmm oki, my bad, I found in OMP-API 5.2 document the following clause: "chunks are assigned to the threads in the team in a round-robin fashion in the order of the thread number" - so this microsoft example is legit. Anyway, even with deterministic workload split, you still have a race :)

|

Joachim · Accepted Answer · 2024-10-31 19:03:05Z

By adding OpenMP directives, you assert that the code will be free of data race when executing in parallel. The code has a loop-carried dependence which will result in a data race during parallel execution. You can transform the loop easily into a dependence free (and therefore data race free) loop:

#pragma omp parallel num_threads(4)
{
    #pragma omp for schedule(static, 64)
    for(int i = 0; i < 256; i++)
    {
        d[i] = a[i];
        /* apply i:=i-1 to move the access to d into the same iteration */
        if( i < 256 && i - 1 >= 0 )
            d[i] = b[i - 1];
        /* apply i:=i-2 to move the access to d into the same iteration */
        if( i < 256 && i - 2 >= 0 )
            d[i] = c[i - 2];
    }
}

The comparison for the lower bound is necessary, as the initial code implicitly assumes i >= 0 based on the iteration space.

This transformation resolves the data race in the code without adding further synchronization. Compilers should be able to apply such transformation automatically and would actually perform such transformation to vectorize the code. Since you assert that such transformation is not necessary by using OpenMP, the compiler will not apply the transformation.

Collectives™ on Stack Overflow

Issues when parallelizing for loop to perform array assignments

2 Answers 2

19 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

19 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related