4

I have a simple question about using OpenMP (with C++) that I hoped someone could help me with. I've included a small example below to illustrate my problem.

#include<iostream>
#include<vector>
#include<ctime>
#include<omp.h>

using namespace std;

int main(){
  srand(time(NULL));//Seed random number generator                                                                               

  vector<int>v;//Create vector to hold random numbers in interval [0,9]                                                                                   
  vector<int>d(10,0);//Vector to hold counts of each integer initialized to 0                                                                    

  for(int i=0;i<1e9;++i)
    v.push_back(rand()%10);//Push back random numbers [0,9]                                                                      

  clock_t c=clock();

  #pragma omp parallel for
  for(int i=0;i<v.size();++i)
    d[v[i]]+=1;//Count number stored at v[i]                                                                                     

  cout<<"Seconds: "<<(clock()-c)/CLOCKS_PER_SEC<<endl;

  for(vector<int>::iterator i=d.begin();i!=d.end();++i)
  cout<<*i<<endl;

  return 0;
}

The above code creates a vector v that contains 1 billion random integers in the range [0,9]. Then, the code loops through v counting how many instances of each different integer there is (i.e., how many ones are found in v, how many twos, etc.)

Each time a particular integer is encountered, it is counted by incrementing the appropriate element of a vector d. So, d[0] counts how many zeroes, d[6] counts how many sixes, and so on. Make sense so far?

My problem is when I try to make the counting loop parallel. Without the #pragma OpenMP statement, my code takes 20 seconds, yet with the pragma it takes over 60 seconds.

Clearly, I've misunderstood some concept relating to OpenMP (perhaps how data is shared/accessed?). Could someone explain my error please or point me in the direction of some insightful literature with appropriate keywords to help my search?

4
  • Your approach to paralyzing you algorithm is wrong. Two threads can access the same index at the same time by your approach. Commented Jul 25, 2012 at 15:13
  • No repro on the time issue, but your code clearly has a race condition. Commented Jul 25, 2012 at 15:13
  • 6
    As @IanMedeiros points out, many people confuse the words 'parallelising' and 'paralysing'. A data race will generally paralyse a parallelised program. Repeat 30 times. Commented Jul 25, 2012 at 15:29
  • @IanMedeiros can you explain the difference between them to me ? Commented Jul 10, 2016 at 10:58

2 Answers 2

6

Your code exibits:

  • race conditions due to unsyncronised access to a shared variable
  • false and true sharing cache problems
  • wrong measurement of run time

Race conditions arise because you are concurrently updating the same elements of vector d in multiple threads. Comment out the srand() line and run your code several times with the same number of threads (but with more than one thread). Compare the outputs from different runs.

False sharing occurs when two threads write to memory locations that are close to one another as to result on the same cache line. This results in the cache line constantly bouncing from core to core or CPU to CPU in multisocket systems and excess of cache coherency messages. With 32 bytes per cache line 8 elements of the vector could fit in one cache line. With 64 bytes per cache line the whole vector d fits in one cache line. This makes the code slow on Core 2 processors and slightly slower (but not as slow as on Core 2) on Nehalem and post-Nehalem (e.g. Sandy Bridge) ones. True sharing occurs at those elements that are accesses by two or more threads at the same time. You should either put the increment in an OpenMP atomic construct (slow), use an array of OpenMP locks to protect access to elements of d (faster or slower, depending on your OpenMP runtime) or accumulate local values and then do a final synchronised reduction (fastest). The first one is implemented like this:

#pragma omp parallel for
for(int i=0;i<v.size();++i)
  #pragma omp atomic
  d[v[i]]+=1;//Count number stored at v[i]   

The second is implemented like this:

omp_lock_t locks[10];
for (int i = 0; i < 10; i++)
  omp_init_lock(&locks[i]);

#pragma omp parallel for
for(int i=0;i<v.size();++i)
{
  int vv = v[i];
  omp_set_lock(&locks[vv]);
  d[vv]+=1;//Count number stored at v[i]
  omp_unset_lock(&locks[vv]);
}

for (int i = 0; i < 10; i++)
  omp_destroy_lock(&locks[i]);

(include omp.h to get access to the omp_* functions)

I leave it up to you to come up with an implementation of the third option.

You are measuring elapsed time using clock() but it measures the CPU time, not the runtime. If you have one thread running at 100% CPU usage for 1 second, then clock() would indicata an increase in CPU time of 1 second. If you have 8 threads running at 100% CPU usage for 1 second, clock() would indicate an increate in CPU time of 8 seconds (that is 8 threads times 1 CPU second per thread). Use omp_get_wtime() or gettimeofday() (or some other high resolution timer API) instead.

Sign up to request clarification or add additional context in comments.

7 Comments

I would assume OpenMP atomics to map directly to the corresponding corresponding assembly instructions, so assuming the hardware supports atomic increment of ints (like your average x86), how could that possibly be slower then using mutices?
Unfortunately GCC (even 4.7.0) disagrees with your assumptions and inserts function calls to omp_set_lock() and omp_unset_lock(), even at -O3. These function calls are in the shared PIC of libgomp.so and result in further jumps...
On the other hand, the proposed solution with #pragma omp atomic is translated to lock addl $1, (%rax,%rdx,4) (%rax holds the base address of the data array, %rdx holds the value of vv).
You are sying #pragma omp atomics uses both omp_set_lock and is translated to lock add ...? That makes no sense at all. From testing it should be transformed into the second version (so just one lock add or lock inc), so once again, why would that be slower then using omp_set_lock/omp_unset_lock?
Sorry, must have misread your comment. It's OpenMP atomic construct, not atomics, and indeed it gets translated to a locked increment only in this case. But that is still an implementation detail and atomic statements could be implemented in a completely different way.
|
1

EDIT Once your race condition is resolved via correct synchronization, then the following paragraph applies, before that your data race conditions unfortunately make speed comparisons mute:

Your program is slowing down because you have 10 possible outputs during the pragma section which are being accessed randomly. OpenMP cannot access any of those elements without a lock (which you would need to provide via synchronization) as a result and locking will cause your threads to have a higher overhead than you gain from counting in parallel.

A solution to make this speed up, is to instead make a local variable for each OpenMP thread which counts all of the 0-10 values that a particular thread has seen. Then sum those up in the master count vector. This will be easily parallelized and much faster as the threads don't need to lock on a shared write vector. I would expect a close to Nx speed up where N is the number of threads from OpenMP as there should be very limited locking required. This solution also avoids a lot of the race conditions currently in your code.

See http://software.intel.com/en-us/articles/use-thread-local-storage-to-reduce-synchronization/ for more details on thread local OpenMP

5 Comments

-1 The proposed solution is fine, but the explanation of why it slows down is wrong. OpenMP does not synchronise access to shared variables by default - one has to do it explicitly.
@HristoIliev Ah yes, I missed that he didn't synchronize and focused more on the overall solution to the issue before I addressed conflicts -- I added an edit to me answer to reflect this
"OpenMP cannot access ..." On the contrary - it can. That's why there are explicit synchronisation mechanisms provided by OpenMP.
Yes this is true. I originally missed the lack of synchronization. I updated my answer to state this is only true After synchronization is correctly in place.
"make a local copy for each omp thread" will consume much more mem if the variable is big, right

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.