OpenMP and C++ parallel for loop: why does my code slow down when using OpenMP?

Question

I have a simple question about using OpenMP (with C++) that I hoped someone could help me with. I've included a small example below to illustrate my problem.

#include<iostream>
#include<vector>
#include<ctime>
#include<omp.h>

using namespace std;

int main(){
  srand(time(NULL));//Seed random number generator                                                                               

  vector<int>v;//Create vector to hold random numbers in interval [0,9]                                                                                   
  vector<int>d(10,0);//Vector to hold counts of each integer initialized to 0                                                                    

  for(int i=0;i<1e9;++i)
    v.push_back(rand()%10);//Push back random numbers [0,9]                                                                      

  clock_t c=clock();

  #pragma omp parallel for
  for(int i=0;i<v.size();++i)
    d[v[i]]+=1;//Count number stored at v[i]                                                                                     

  cout<<"Seconds: "<<(clock()-c)/CLOCKS_PER_SEC<<endl;

  for(vector<int>::iterator i=d.begin();i!=d.end();++i)
  cout<<*i<<endl;

  return 0;
}

The above code creates a vector v that contains 1 billion random integers in the range [0,9]. Then, the code loops through v counting how many instances of each different integer there is (i.e., how many ones are found in v, how many twos, etc.)

Each time a particular integer is encountered, it is counted by incrementing the appropriate element of a vector d. So, d[0] counts how many zeroes, d[6] counts how many sixes, and so on. Make sense so far?

My problem is when I try to make the counting loop parallel. Without the #pragma OpenMP statement, my code takes 20 seconds, yet with the pragma it takes over 60 seconds.

Clearly, I've misunderstood some concept relating to OpenMP (perhaps how data is shared/accessed?). Could someone explain my error please or point me in the direction of some insightful literature with appropriate keywords to help my search?

Your approach to paralyzing you algorithm is wrong. Two threads can access the same index at the same time by your approach. — Ian Medeiros
– Ian Medeiros, Commented Jul 25, 2012 at 15:13
No repro on the time issue, but your code clearly has a race condition. — kennytm
– kennytm, Commented Jul 25, 2012 at 15:13
As @IanMedeiros points out, many people confuse the words 'parallelising' and 'paralysing'. A data race will generally paralyse a parallelised program. Repeat 30 times. — High Performance Mark
– High Performance Mark, Commented Jul 25, 2012 at 15:29
@IanMedeiros can you explain the difference between them to me ? — pooria
– pooria, Commented Jul 10, 2016 at 10:58

Hristo Iliev · Accepted Answer · 2012-07-25 17:54:47Z

6

Your code exibits:

race conditions due to unsyncronised access to a shared variable
false and true sharing cache problems
wrong measurement of run time

Race conditions arise because you are concurrently updating the same elements of vector d in multiple threads. Comment out the srand() line and run your code several times with the same number of threads (but with more than one thread). Compare the outputs from different runs.

False sharing occurs when two threads write to memory locations that are close to one another as to result on the same cache line. This results in the cache line constantly bouncing from core to core or CPU to CPU in multisocket systems and excess of cache coherency messages. With 32 bytes per cache line 8 elements of the vector could fit in one cache line. With 64 bytes per cache line the whole vector d fits in one cache line. This makes the code slow on Core 2 processors and slightly slower (but not as slow as on Core 2) on Nehalem and post-Nehalem (e.g. Sandy Bridge) ones. True sharing occurs at those elements that are accesses by two or more threads at the same time. You should either put the increment in an OpenMP atomic construct (slow), use an array of OpenMP locks to protect access to elements of d (faster or slower, depending on your OpenMP runtime) or accumulate local values and then do a final synchronised reduction (fastest). The first one is implemented like this:

#pragma omp parallel for
for(int i=0;i<v.size();++i)
  #pragma omp atomic
  d[v[i]]+=1;//Count number stored at v[i]

The second is implemented like this:

omp_lock_t locks[10];
for (int i = 0; i < 10; i++)
  omp_init_lock(&locks[i]);

#pragma omp parallel for
for(int i=0;i<v.size();++i)
{
  int vv = v[i];
  omp_set_lock(&locks[vv]);
  d[vv]+=1;//Count number stored at v[i]
  omp_unset_lock(&locks[vv]);
}

for (int i = 0; i < 10; i++)
  omp_destroy_lock(&locks[i]);

(include omp.h to get access to the omp_* functions)

I leave it up to you to come up with an implementation of the third option.

You are measuring elapsed time using clock() but it measures the CPU time, not the runtime. If you have one thread running at 100% CPU usage for 1 second, then clock() would indicata an increase in CPU time of 1 second. If you have 8 threads running at 100% CPU usage for 1 second, clock() would indicate an increate in CPU time of 8 seconds (that is 8 threads times 1 CPU second per thread). Use omp_get_wtime() or gettimeofday() (or some other high resolution timer API) instead.

edited Jul 25, 2012 at 17:54

answered Jul 25, 2012 at 17:39

Hristo Iliev

74.8k12 gold badges147 silver badges197 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Grizzly Over a year ago

I would assume OpenMP atomics to map directly to the corresponding corresponding assembly instructions, so assuming the hardware supports atomic increment of ints (like your average x86), how could that possibly be slower then using mutices?

Hristo Iliev Over a year ago

Unfortunately GCC (even 4.7.0) disagrees with your assumptions and inserts function calls to omp_set_lock() and omp_unset_lock(), even at -O3. These function calls are in the shared PIC of libgomp.so and result in further jumps...

Hristo Iliev Over a year ago

On the other hand, the proposed solution with #pragma omp atomic is translated to lock addl $1, (%rax,%rdx,4) (%rax holds the base address of the data array, %rdx holds the value of vv).

Grizzly Over a year ago

You are sying #pragma omp atomics uses both omp_set_lock and is translated to lock add ...? That makes no sense at all. From testing it should be transformed into the second version (so just one lock add or lock inc), so once again, why would that be slower then using omp_set_lock/omp_unset_lock?

Hristo Iliev Over a year ago

Sorry, must have misread your comment. It's OpenMP atomic construct, not atomics, and indeed it gets translated to a locked increment only in this case. But that is still an implementation detail and atomic statements could be implemented in a completely different way.

|

Pyrce · Accepted Answer · 2012-07-25 18:03:09Z

1

EDIT Once your race condition is resolved via correct synchronization, then the following paragraph applies, before that your data race conditions unfortunately make speed comparisons mute:

Your program is slowing down because you have 10 possible outputs during the pragma section which are being accessed randomly. OpenMP cannot access any of those elements without a lock (which you would need to provide via synchronization) as a result and locking will cause your threads to have a higher overhead than you gain from counting in parallel.

A solution to make this speed up, is to instead make a local variable for each OpenMP thread which counts all of the 0-10 values that a particular thread has seen. Then sum those up in the master count vector. This will be easily parallelized and much faster as the threads don't need to lock on a shared write vector. I would expect a close to Nx speed up where N is the number of threads from OpenMP as there should be very limited locking required. This solution also avoids a lot of the race conditions currently in your code.

See http://software.intel.com/en-us/articles/use-thread-local-storage-to-reduce-synchronization/ for more details on thread local OpenMP

edited Jul 25, 2012 at 18:03

answered Jul 25, 2012 at 16:57

Pyrce

8,5713 gold badges33 silver badges47 bronze badges

5 Comments

Hristo Iliev Over a year ago

-1 The proposed solution is fine, but the explanation of why it slows down is wrong. OpenMP does not synchronise access to shared variables by default - one has to do it explicitly.

Pyrce Over a year ago

@HristoIliev Ah yes, I missed that he didn't synchronize and focused more on the overall solution to the issue before I addressed conflicts -- I added an edit to me answer to reflect this

Hristo Iliev Over a year ago

"OpenMP cannot access ..." On the contrary - it can. That's why there are explicit synchronisation mechanisms provided by OpenMP.

Pyrce Over a year ago

Yes this is true. I originally missed the lack of synchronization. I updated my answer to state this is only true After synchronization is correctly in place.

avocado Over a year ago

"make a local copy for each omp thread" will consume much more mem if the variable is big, right

Collectives™ on Stack Overflow

OpenMP and C++ parallel for loop: why does my code slow down when using OpenMP?

2 Answers 2

7 Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related