With OpenMP parallelized nested loops run slow

Question

I've got a part of a fortran program consisting of some nested loops which I want to parallelize with OpenMP.

integer :: nstates , N, i, dima, dimb, dimc, a_row, b_row, b_col, c_row, row, col
double complex, dimension(4,4):: mat
double complex, dimension(:), allocatable :: vecin,vecout 

nstates = 2
N = 24

allocate(vecin(nstates**N), vecout(nstates**N))
vecin = ...some data
vecout = 0

mat = reshape([...some data...],[4,4])

dimb=nstates**2

!$OMP PARALLEL DO PRIVATE(dima,dimc,row,col,a_row,b_row,c_row,b_col) 
do i=1,N-1
    dima=nstates**(i-1)
    dimc=nstates**(N-i-1)

    do a_row = 1, dima
        do b_row = 1,dimb
            do c_row = 1,dimc
                row = ((a_row-1)*dimb + b_row - 1)*dimc + c_row
                do b_col = 1,dimb
                    col = ((a_row-1)*dimb + b_col - 1)*dimc + c_row
                    !$OMP ATOMIC
                    vecout(row) = vecout(row) + vecin(col)*mat(b_row,b_col)
                end do
            end do
        end do
    end do
end do
!$OMP END PARALLEL DO

The program runs and the result I get is also correct, it's just incredible slow. Much slower than without OpenMP. I don't know much about OpenMP. Have I done something wrong with the use of PRIVATE or OMP ATOMIC? I would be grateful for every advice how to improve the performance of my code.

You should take a look at the reduction clause for vecout. This should speed things up ;-) — Alexander Vogt
– Alexander Vogt, Commented Apr 1, 2015 at 8:14
Using an atomic instruction in the most inner loop cannot be a good idea ! — Francois Jacq
– Francois Jacq, Commented Apr 1, 2015 at 10:15
Thanks for your answers. Why is the atomic instruction a bad idea? I deleted the atomic instruction and used REDUCTION(+:vecout) instead, which leads to a segmentation fault. Is there something special about reduction and arrays? — Yannick
– Yannick, Commented Apr 1, 2015 at 10:37
The reduction works without segmentation fault just for small arrays. — Yannick
– Yannick, Commented Apr 1, 2015 at 10:50
Then you need to increase the stack size: ulimit -s unlimited — Alexander Vogt
– Alexander Vogt, Commented Apr 1, 2015 at 11:06

Vladimir F Героям слава · Accepted Answer · 2015-04-01 12:28:11Z

2

If your arrays are too large and you get stack overflows with automatic reduction, you can implement the reduction yourself with allocatable temporary arrays.

As Francois Jacq pointed out, you also have a race condition caused by dima and dimb which should be private.

double complex, dimension(:), allocatable :: tmp

!$OMP PARALLEL PRIVATE(dima,dimb,row,col,a_row,b_row,c_row,b_col,tmp)

allocate(tmp(size(vecout)))
tmp = 0

!$OMP DO
do i=1,N-1
    dima=nstates**(i-1)
    dimc=nstates**(N-i-1)

    do a_row = 1, dima
        do b_row = 1,dimb
            do c_row = 1,dimc
                row = ((a_row-1)*dimb + b_row - 1)*dimc + c_row
                do b_col = 1,dimb
                    col = ((a_row-1)*dimb + b_col - 1)*dimc + c_row
                    tmp(row) = tmp(row) + vecin(col)*mat(b_row,b_col)
                end do
            end do
        end do
    end do
end do
!$OMP END DO

!$OMP CRITICAL
vecout = vecout + tmp
!$OMP END CRITICAL
!$OMP END PARALLEL

edited Apr 1, 2015 at 12:28

answered Apr 1, 2015 at 11:41

Vladimir F Героям слава

60.7k4 gold badges82 silver badges131 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Yannick Over a year ago

Thanks, that worked. But why is this method faster than using atomic?

Francois Jacq Over a year ago

the variables dima and dimc should be private

Yannick Over a year ago

Yes, I recognized this before and changed it now in my question.

Francois Jacq Over a year ago

From my point of view, they are still race conditions because two threads could have the same row index. See my proposed solution.

Vladimir F Героям слава Over a year ago

@FrancoisJacq They can have the same row index, but they write to a different array. Your solution is good, but there is a lot of synchronization on each !$omp parallel do. The final decision should come from performance measurement.

|

Francois Jacq · Accepted Answer · 2015-04-01 12:12:19Z

1

Could you try something like :

do b_col=1,dimb
   do i=1,N-1
      dima=nstates**(i-1)
      dimc=nstates**(N-i-1)
      !$OMP PARALLEL DO COLLAPSE(3) PRIVATE(row,col,a_row,b_row,c_row)
      do a_row = 1, dima
         do b_row = 1,dimb
            do c_row = 1,dimc
                row = ((a_row-1)*dimb + b_row - 1)*dimc + c_row
                col = ((a_row-1)*dimb + b_col - 1)*dimc + c_row
                vecout(row) = vecout(row) + vecin(col)*mat(b_row,b_col)
            enddo
         enddo
      enddo
   enddo
enddo

The advantage is that the // loop does not cause collision now : all the indexes row are different.

answered Apr 1, 2015 at 12:12

Francois Jacq

1,37410 silver badges19 bronze badges

5 Comments

Yannick Over a year ago

Thanks for the answer. I will measure the time of your and Vladimir F's solution to compare them. What I can say until now is, that your solution runs faster for me without COLLAPSE. Can you tell my why you put the b_col loop out of the others? The row index should be different for every thread also when the b_col loop is inside, shouldn't it?

Francois Jacq Over a year ago

the loop b_col is outside the // loop two avoid that threads could compute the same row index => race situation when modifying vecout (risk of random mistake). Why did you delete the COLLAPSE clause ? It just indicates that the three next loops may be merged in a single // one.

Yannick Over a year ago

row = ((a_row-1)*dimb + b_row - 1)*dimc + c_row , so row does not depend on b_col. To your collapse question: First I tried with collapse and then without. Without it was faster so why should I use it?

Francois Jacq Over a year ago

because row does not depend on b_col, the same row index may be obtained b_col times. About COLLAPSE, if you have measured with and without then all is OK : the judge is always the measurement !

Francois Jacq Over a year ago

In my previous message, read "dimb times" instead of "b_col times"

Collectives™ on Stack Overflow

With OpenMP parallelized nested loops run slow

2 Answers 2

6 Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related