2

Consider the following code:

#include <stdlib.h>
#include <math.h>
float f(float M[32][32]) { 
  int counter, s = 1, i, j, n = 32;
  float *delta = malloc(n * sizeof *delta);
  float *v = malloc(n * sizeof *v);
  float prod = 1, p = 0;
  for (i = 0; i < n; i++) {
      v[i] = 0;
      delta[i] = 1;
  }
  for(counter=1; counter < pow(2,n-1); counter++) { 
      prod = 1.;
      j = __builtin_ctzl(counter);
      for (i = 0; i < n; i++) {
        v[i] -= 2.*delta[j]*M[j][i];
        prod *= v[i];
      }
      delta[j] = -delta[j];
      s = -s;            
      p += s*prod;    
    }
  return p;
}

I would like to parallelize the outer loop involving counter using openmp.

It seems that this ought not to be too hard but I am not really sure where to start. j = __builtin_ctzl(counter) is just giving the ruler sequence. It is tempting to use #pragma omp parallel for but the way j is calculated seems to cause a problem.

How can one do this?

4
  • Have you seen this? Commented Feb 1, 2017 at 14:30
  • What's the typical n in your case? Please provide a minimal reproducible example such that one can test if a parallel version provides a speedup and runs correctly. Commented Feb 1, 2017 at 14:43
  • @Zulan n is typically in the range 32 to 36. I put 32 in the example code. In order to give a version we can test for speed I just need to produce sample input data, right? I can't do that right now but I will as soon as possible. Commented Feb 1, 2017 at 14:53
  • 1
    @ryyker Thank you. There are slight subtleties here in s = -s; p += s*prod; but maybe it can be done straightforwardly? Commented Feb 1, 2017 at 14:55

2 Answers 2

3

It seems that this ought not to be too hard but I am not really sure where to start.

If it seems that it ought not to be hard then you have not properly appreciated the task. Not all high-iteration-count loops are good candidates for parallel computation.

j = __builtin_ctzl(counter) is just giving the ruler sequence. It is tempting to use #pragma omp parallel for but the way j is calculated seems to cause a problem.

It's not so much the way j is calculated, but how it is used and the values it takes over the course of the serial computation that presents a problem. Note in particular that

  • each loop iteration uses the value of delta[j]
  • each loop iteration also modifies the value of delta[j]
  • j takes the same values repeatedly, many times over the course of the overall computation

Thus, your use of delta and j creates many dependencies between different iterations of your loop. Iterations cannot be reordered or run in parallel unless relative order is preserved for iterations that use the same value of j -- otherwise, the result of the computation may be altered.

And that's not all. Each iteration of your inner loop both uses and modifies s and all the elements of array v. These are more serious problems for parallelization, because they make every iteration of the outer loop dependent on all previous iterations.

How can one do this?

You need to deal with the dependency problems before this computation can make use of parallelism. If the problem were just with delta and j then perhaps you could rearrange the computation to handle all computations involving each j value together. If it were just with s then it's very likely that you could split the computation into s == 1 and s == -1 blocks, and then parallelize away. I'm having trouble, however, seeing how you could deal with both of those simultaneously, or how you could deal with the dependency involving v at all.

Perhaps you have some insight into the nature of the computation that will enable you reimplement it in a form more suitable for parallel computation. The form it's in now is inherently serial, however.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for this. I will improve the question in a few days and hopefully some of these issues will become clearer.
1

Parallelization is a good idea when variables that are computed inside a loop are only the counter and external unchanging variables.

Parallelization is a bad idea when variables that are computed inside a loop depend on the same variables from a previous iteration.

A simple exercise to check if a serial loop is suitable for parallelization is to try reversing the order of the loop. If the counter is in ascending order, try it in descending order. If the loop produces the same results, parallelization is well-suited to your problem.

In your problem, many of the variables are dependent on their values from previous iterations: v, delta, s, and p.

Since the intermediate values of p don't matter, it's not a problem. omp parallel reduction can handle this.

s can be redefined so that it does not depend on its previous value. if (counter % 2 == 0){s = 1;}else{s = -1;}

However, I don't see a good way to handle delta and v. This requires more familiarity with the specifics of this problem than I have. If you can find a way to define these quantities in a non-recursive manner, then OpenMP will become very useful. Good luck!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.