Performance issue while parallelizing using OpenMP

Question

I was trying to parallelize a code but it only deteriorated the performance. I wrote a Fortran code which runs several Monte Carlo integrations and then finds their mean.

      implicit none  
      integer, parameter :: n=100
      integer, parameter :: m=1000000
      real, parameter :: pi=3.141592654

      real MC,integ,x,y
      integer tid,OMP_GET_THREAD_NUM,i,j,init,inside  
      read*,init
      call OMP_SET_NUM_THREADS(init)
      call random_seed()
!$OMP PARALLEL DO PRIVATE(J,X,Y,INSIDE,MC) 
!$OMP& REDUCTION(+:INTEG)
      do i=1,n
         inside=0
         do j=1,m
           call random_number(x)
           call random_number(y)

           x=x*pi
           y=y*2.0

           if(y.le.x*sin(x))then
             inside=inside+1
           endif

         enddo

         MC=inside*2*pi/m
         integ=integ+MC/n
      enddo
!$OMP END PARALLEL DO

      print*, integ
      end

As I increase the number of threads, run-time increases drastically. I have looked for solutions for such problems and in most cases shared memory elements happen to be the problem but I cannot see how it is affecting my case.

I am running it on a 16 core processor using Intel Fortran compiler.

EDIT: The program after adding implicit none, declaring all variables and adding the private clause

Welcome. Please take the Tour, it is recommended for all newcomers. How did you measure the time? — Vladimir F Героям слава
– Vladimir F Героям слава, Commented Jun 2, 2017 at 13:43
Please see how I changed the indentation in your code. It is good to post all your code indented so that people can see the structure. It is good for you to keep your source files that way too. — Vladimir F Героям слава
– Vladimir F Героям слава, Commented Jun 2, 2017 at 13:45
What is inside? Where does it come from? Please make sure you use IMPLICIT NONE. — Vladimir F Героям слава
– Vladimir F Героям слава, Commented Jun 2, 2017 at 13:47
@VladimirF Thank you for the edit and suggestions. The program took a few seconds to run with a single thread and more than a minute using multiple threads so I did not use any inbuilt functions to measure time. inside is the number of random points which lie under the curve x*sin(x) where x goes from 0 to pi. The total area in which m random points are generated is 2*pi so the area under the curve is inside*(2*pi)/m which is stored in MC — Vishal Singh
– Vishal Singh, Commented Jun 3, 2017 at 13:55

Vladimir F Героям слава · Accepted Answer · 2017-06-16 22:22:27Z

0

You should not use RANDOM_NUMBER for high performance computing and definitely not in parallel threads. There NO guarantees about the quality of the random number generator and about thread safety of the standard random number generator. See Can Random Number Generator of Fortran 90 be trusted for Monte Carlo Integration?

Some compilers will use a fast algorithm that cannot be called in parallel. Some compilers will ave slow method but callable from parallel. Some will be both fast and allowed from parallel. Some will generate poor quality random sequences, some better.

You should use some parallel PRNG library. There are many. See here for recommendations for Intel https://software.intel.com/en-us/forums/intel-math-kernel-library/topic/283349 I use library based on http://www.cmiss.org/openCMISS/wiki/RandomNumberGenerationWithOpenMP in my own slightly improved version https://bitbucket.org/LadaF/elmm/src/e732cb9bee3352877d09ae7f6b6722157a819f2c/src/simplevtk.f90?at=master&fileviewer=file-view-default but be careful, I don't care about the quality of the sequence in my applications, only about speed.

To the old version:

You have a race condition there.

With

 inside=inside+1

more threads can be competing for writing and reading the variable. You will have to somehow synchronize the access. If you make it reduction you will have problems with

integ=integ+MC/n

if you make it private, then inside=inside+1 will only count locally.

MC also appears to be in a race condition, because more threads will be writing in it. It is not clear at all what MC does and why is it there, because you are not using the value anywhere. Are you sure the code you show is complete? If not, please see How to make a Minimal, Complete, and Verifiable example.

See this With OpenMP parallelized nested loops run slow an many other examples how a race condition can make program slow.

edited Jun 16, 2017 at 22:22

answered Jun 2, 2017 at 13:55

Vladimir F Героям слава

60.7k4 gold badges82 silver badges131 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

Vishal Singh Over a year ago

inside should, by default, be a private variable. I wish the outer loop to be run in parallel such that the inner loop is independently run in different threads. MC stores the value of integration. MC is calculated n times independently and the mean is stored in integ. I also tried to remove integ so that everything is private just to see the time difference but it did not have any impact on the run time.

Vladimir F Героям слава Over a year ago

Why do you think it is default by private? You have no such setting in your code. By default everything is shared. I strongly believe the program is slow because the race condition. Did you compare the results of the serial and parallel code? Are they the same or do they differ? I bet they will differ.

Vishal Singh Over a year ago

I tried adding DEFAULT(PRIVATE) but it did not have any effect (that's why I thought it must be private by default, my bad) . Surprisingly, the results are same for both the serial and parallel code with or without using DEFAULT(PRIVATE). Also the result is correct upto 3 decimal places. I also tried defining all the variables private using the PRIVATE clause but that also did not help.

Vladimir F Героям слава Over a year ago

Without a minimal reproducible example noone can help you. It must be compilable and complete.

Vishal Singh Over a year ago

Although it was inspired from the problem in a bigger code, I am compiling this code itself using ifort without any issues. I do accept it is poorly written and might make it difficult to find the problem. I would also like to ask, though every iteration took almost one second on my machine using single thread, could it possible that there is some overhead problem.

|

Collectives™ on Stack Overflow

Performance issue while parallelizing using OpenMP

1 Answer 1

10 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

10 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related