1

I've got a part of a fortran program consisting of some nested loops which I want to parallelize with OpenMP.

integer :: nstates , N, i, dima, dimb, dimc, a_row, b_row, b_col, c_row, row, col
double complex, dimension(4,4):: mat
double complex, dimension(:), allocatable :: vecin,vecout 

nstates = 2
N = 24

allocate(vecin(nstates**N), vecout(nstates**N))
vecin = ...some data
vecout = 0

mat = reshape([...some data...],[4,4])

dimb=nstates**2

!$OMP PARALLEL DO PRIVATE(dima,dimc,row,col,a_row,b_row,c_row,b_col) 
do i=1,N-1
    dima=nstates**(i-1)
    dimc=nstates**(N-i-1)

    do a_row = 1, dima
        do b_row = 1,dimb
            do c_row = 1,dimc
                row = ((a_row-1)*dimb + b_row - 1)*dimc + c_row
                do b_col = 1,dimb
                    col = ((a_row-1)*dimb + b_col - 1)*dimc + c_row
                    !$OMP ATOMIC
                    vecout(row) = vecout(row) + vecin(col)*mat(b_row,b_col)
                end do
            end do
        end do
    end do
end do
!$OMP END PARALLEL DO 

The program runs and the result I get is also correct, it's just incredible slow. Much slower than without OpenMP. I don't know much about OpenMP. Have I done something wrong with the use of PRIVATE or OMP ATOMIC? I would be grateful for every advice how to improve the performance of my code.

6
  • 2
    You should take a look at the reduction clause for vecout. This should speed things up ;-) Commented Apr 1, 2015 at 8:14
  • Using an atomic instruction in the most inner loop cannot be a good idea ! Commented Apr 1, 2015 at 10:15
  • Thanks for your answers. Why is the atomic instruction a bad idea? I deleted the atomic instruction and used REDUCTION(+:vecout) instead, which leads to a segmentation fault. Is there something special about reduction and arrays? Commented Apr 1, 2015 at 10:37
  • The reduction works without segmentation fault just for small arrays. Commented Apr 1, 2015 at 10:50
  • Then you need to increase the stack size: ulimit -s unlimited Commented Apr 1, 2015 at 11:06

2 Answers 2

2

If your arrays are too large and you get stack overflows with automatic reduction, you can implement the reduction yourself with allocatable temporary arrays.

As Francois Jacq pointed out, you also have a race condition caused by dima and dimb which should be private.

double complex, dimension(:), allocatable :: tmp

!$OMP PARALLEL PRIVATE(dima,dimb,row,col,a_row,b_row,c_row,b_col,tmp)

allocate(tmp(size(vecout)))
tmp = 0

!$OMP DO
do i=1,N-1
    dima=nstates**(i-1)
    dimc=nstates**(N-i-1)

    do a_row = 1, dima
        do b_row = 1,dimb
            do c_row = 1,dimc
                row = ((a_row-1)*dimb + b_row - 1)*dimc + c_row
                do b_col = 1,dimb
                    col = ((a_row-1)*dimb + b_col - 1)*dimc + c_row
                    tmp(row) = tmp(row) + vecin(col)*mat(b_row,b_col)
                end do
            end do
        end do
    end do
end do
!$OMP END DO

!$OMP CRITICAL
vecout = vecout + tmp
!$OMP END CRITICAL
!$OMP END PARALLEL
Sign up to request clarification or add additional context in comments.

6 Comments

Thanks, that worked. But why is this method faster than using atomic?
the variables dima and dimc should be private
Yes, I recognized this before and changed it now in my question.
From my point of view, they are still race conditions because two threads could have the same row index. See my proposed solution.
@FrancoisJacq They can have the same row index, but they write to a different array. Your solution is good, but there is a lot of synchronization on each !$omp parallel do. The final decision should come from performance measurement.
|
1

Could you try something like :

do b_col=1,dimb
   do i=1,N-1
      dima=nstates**(i-1)
      dimc=nstates**(N-i-1)
      !$OMP PARALLEL DO COLLAPSE(3) PRIVATE(row,col,a_row,b_row,c_row)
      do a_row = 1, dima
         do b_row = 1,dimb
            do c_row = 1,dimc
                row = ((a_row-1)*dimb + b_row - 1)*dimc + c_row
                col = ((a_row-1)*dimb + b_col - 1)*dimc + c_row
                vecout(row) = vecout(row) + vecin(col)*mat(b_row,b_col)
            enddo
         enddo
      enddo
   enddo
enddo

The advantage is that the // loop does not cause collision now : all the indexes row are different.

5 Comments

Thanks for the answer. I will measure the time of your and Vladimir F's solution to compare them. What I can say until now is, that your solution runs faster for me without COLLAPSE. Can you tell my why you put the b_col loop out of the others? The row index should be different for every thread also when the b_col loop is inside, shouldn't it?
the loop b_col is outside the // loop two avoid that threads could compute the same row index => race situation when modifying vecout (risk of random mistake). Why did you delete the COLLAPSE clause ? It just indicates that the three next loops may be merged in a single // one.
row = ((a_row-1)*dimb + b_row - 1)*dimc + c_row , so row does not depend on b_col. To your collapse question: First I tried with collapse and then without. Without it was faster so why should I use it?
because row does not depend on b_col, the same row index may be obtained b_col times. About COLLAPSE, if you have measured with and without then all is OK : the judge is always the measurement !
In my previous message, read "dimb times" instead of "b_col times"

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.