OpenMP and C parallel for loop: why does my code slow down when using OpenMP?

Question

I'm new here and a beginner level programmer in C. I'm having some problem with using openmp to speedup the for-loop. Below is simple example:

#include <stdlib.h>
#include <stdio.h>
#include <gsl/gsl_rng.h>
#include <omp.h>

gsl_rng *rng;

main()
{
int i, M=100000000;
double tmp;

/* initialize RNG */
gsl_rng_env_setup();
rng = gsl_rng_alloc (gsl_rng_taus);
gsl_rng_set (rng,(unsigned long int)791526599);

// option 1: parallel        
  #pragma omp parallel for default(shared) private( i, tmp ) schedule(dynamic)
  for(i=0;i<=M-1;i++){
     tmp=gsl_ran_gamma_mt(rng, 4, 1./3 );
  }


// option 2: sequential       
  for(i=0;i<=M-1;i++){
     tmp=gsl_ran_gamma_mt(rng, 4, 1./3 );
  }
}

The code draws from a gamma random distribution for M iterations. It turns out the parallel approach with openmp (option 1) takes about 1 minute while the sequential approach (option 2) takes only 20 seconds. While running with openmp, I can see the cpu usage is 800% ( the server I'm using has 8 CPUs ). And the system is linux with GCC 4.1.3. The compile command I'm using is gcc -fopenmp -lgsl -lgslcblas -lm (I'm using GSL )

Am I doing something wrong? Please help me! Thanks!

P.S. As pointed out by some users, it might be caused by rng. But even if I replace

tmp=gsl_ran_gamma_mt(rng, 4, 1./3 );

by say

tmp=1000*10000;

the problem still there...

You should not make your loop variable i private - OpenMP will take care of that. I don't know if this affects the execution, but you should fix it and retest. — Lubo Antonov
– Lubo Antonov, Commented Aug 23, 2012 at 16:42
Also, note that tmp=1000*10000 probably gets optimized away by the compiler to a noop, so that will skew your timing. — Lubo Antonov
– Lubo Antonov, Commented Aug 23, 2012 at 16:43
Are you sure there are actually 8 CPUs? Could it be a quad-core with hyperthreading? — Greg Inozemtsev
– Greg Inozemtsev, Commented Aug 23, 2012 at 16:54
I'm not sure if there are actually 8 CPUs, you could be right. I just typed cat /proc/cpuinfo and it shows 8. — user1620200
– user1620200, Commented Aug 23, 2012 at 18:50
I found out that if I get rid of schedule(dynamic), the problem is gone! I wonder why? — user1620200
– user1620200, Commented Aug 23, 2012 at 18:57

Konrad Rudolph · Accepted Answer · 2012-08-23 15:58:54Z

12

gsl_ran_gamma_mt probably locks on rng to prevent concurrency issues (if it didn’t, your parallel code probably contains a race condition and thus yields wrong results). The solution then would be to have a separate rng instance for each thread, thus avoiding locking.

answered Aug 23, 2012 at 15:58

Konrad Rudolph

549k142 gold badges967 silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user1620200 Over a year ago

thanks a lot! but the problem still remains if I replace tmp=gsl_ran_gamma_mt(rng, 4, 1./3 ) with say tmp=1000*10000

Konrad Rudolph Over a year ago

@user1620200 Odd, there should be a drastic diminishment. Of course, there’s still a huge overhead for parallel computations involved since the actual computation is quite fast, even with a lot of iterations. The dynamic schedule doesn’t help (try a static one, there’s no reason for a dynamic schedule here since tasks all have the same size). Finally, try Stephane’s advice but I’d be surprised if that were the issue since this should be trivially optimised.

user1620200 Over a year ago

Thanks. I found out that if I get rid of schedule(dynamic), the problem is gone! I wonder why?

Konrad Rudolph Over a year ago

@user1620200 Because it’s slooow. A dynamic schedule introduces quite a lot of overhead on its own because OpenMP does some heavy lifting behind the scenes to distribute work to the cores. It’s only suitable if the individual computations take drastically different time.

Greg Inozemtsev · Accepted Answer · 2012-08-23 15:58:57Z

5

Your rng variable is shared, so the threads are spending all their time waiting to be able to use the random number generator. Give each thread a separate instance of the RNG. This will probably mean making the RNG initialization code run in parallel as well.

answered Aug 23, 2012 at 15:58

Greg Inozemtsev

4,67925 silver badges27 bronze badges

1 Comment

user1620200 Over a year ago

thanks a lot! but the problem still remains if I replace tmp=gsl_ran_gamma_mt(rng, 4, 1./3 ) with say tmp=1000*10000;

user1620200 · Accepted Answer · 2012-08-23 18:23:25Z

1

Again thanks everyone for helping. I just found out that if I get rid of

schedule(dynamic)

in the code, the problem disapears. But why is that?

answered Aug 23, 2012 at 18:23

user1620200

1031 silver badge6 bronze badges

2 Comments

Hristo Iliev Over a year ago

dynamic scheduling is quite expensive and should only be used in those cases where iterations take considerable amount of time to complete but this time could vary greatly with each iteration. guided is even more expensive as it makes iteration blocks smaller and smaller. Use static scheduling for iterations with fixed amount of compute time.

Lubo Antonov Over a year ago

I think you can accept your own answer, since you were the one to solve the issue.

Collectives™ on Stack Overflow

OpenMP and C parallel for loop: why does my code slow down when using OpenMP?

3 Answers 3

4 Comments

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related