9

I'm new here and a beginner level programmer in C. I'm having some problem with using openmp to speedup the for-loop. Below is simple example:

#include <stdlib.h>
#include <stdio.h>
#include <gsl/gsl_rng.h>
#include <omp.h>

gsl_rng *rng;

main()
{
int i, M=100000000;
double tmp;

/* initialize RNG */
gsl_rng_env_setup();
rng = gsl_rng_alloc (gsl_rng_taus);
gsl_rng_set (rng,(unsigned long int)791526599);

// option 1: parallel        
  #pragma omp parallel for default(shared) private( i, tmp ) schedule(dynamic)
  for(i=0;i<=M-1;i++){
     tmp=gsl_ran_gamma_mt(rng, 4, 1./3 );
  }


// option 2: sequential       
  for(i=0;i<=M-1;i++){
     tmp=gsl_ran_gamma_mt(rng, 4, 1./3 );
  }
}

The code draws from a gamma random distribution for M iterations. It turns out the parallel approach with openmp (option 1) takes about 1 minute while the sequential approach (option 2) takes only 20 seconds. While running with openmp, I can see the cpu usage is 800% ( the server I'm using has 8 CPUs ). And the system is linux with GCC 4.1.3. The compile command I'm using is gcc -fopenmp -lgsl -lgslcblas -lm (I'm using GSL )

Am I doing something wrong? Please help me! Thanks!

P.S. As pointed out by some users, it might be caused by rng. But even if I replace

tmp=gsl_ran_gamma_mt(rng, 4, 1./3 );

by say

tmp=1000*10000;

the problem still there...

5
  • You should not make your loop variable i private - OpenMP will take care of that. I don't know if this affects the execution, but you should fix it and retest. Commented Aug 23, 2012 at 16:42
  • Also, note that tmp=1000*10000 probably gets optimized away by the compiler to a noop, so that will skew your timing. Commented Aug 23, 2012 at 16:43
  • Are you sure there are actually 8 CPUs? Could it be a quad-core with hyperthreading? Commented Aug 23, 2012 at 16:54
  • I'm not sure if there are actually 8 CPUs, you could be right. I just typed cat /proc/cpuinfo and it shows 8. Commented Aug 23, 2012 at 18:50
  • I found out that if I get rid of schedule(dynamic), the problem is gone! I wonder why? Commented Aug 23, 2012 at 18:57

3 Answers 3

12

gsl_ran_gamma_mt probably locks on rng to prevent concurrency issues (if it didn’t, your parallel code probably contains a race condition and thus yields wrong results). The solution then would be to have a separate rng instance for each thread, thus avoiding locking.

Sign up to request clarification or add additional context in comments.

4 Comments

thanks a lot! but the problem still remains if I replace tmp=gsl_ran_gamma_mt(rng, 4, 1./3 ) with say tmp=1000*10000
@user1620200 Odd, there should be a drastic diminishment. Of course, there’s still a huge overhead for parallel computations involved since the actual computation is quite fast, even with a lot of iterations. The dynamic schedule doesn’t help (try a static one, there’s no reason for a dynamic schedule here since tasks all have the same size). Finally, try Stephane’s advice but I’d be surprised if that were the issue since this should be trivially optimised.
Thanks. I found out that if I get rid of schedule(dynamic), the problem is gone! I wonder why?
@user1620200 Because it’s slooow. A dynamic schedule introduces quite a lot of overhead on its own because OpenMP does some heavy lifting behind the scenes to distribute work to the cores. It’s only suitable if the individual computations take drastically different time.
5

Your rng variable is shared, so the threads are spending all their time waiting to be able to use the random number generator. Give each thread a separate instance of the RNG. This will probably mean making the RNG initialization code run in parallel as well.

1 Comment

thanks a lot! but the problem still remains if I replace tmp=gsl_ran_gamma_mt(rng, 4, 1./3 ) with say tmp=1000*10000;
1

Again thanks everyone for helping. I just found out that if I get rid of

schedule(dynamic)

in the code, the problem disapears. But why is that?

2 Comments

dynamic scheduling is quite expensive and should only be used in those cases where iterations take considerable amount of time to complete but this time could vary greatly with each iteration. guided is even more expensive as it makes iteration blocks smaller and smaller. Use static scheduling for iterations with fixed amount of compute time.
I think you can accept your own answer, since you were the one to solve the issue.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.