It seems that this ought not to be too hard but I am not really sure where to start.
If it seems that it ought not to be hard then you have not properly appreciated the task. Not all high-iteration-count loops are good candidates for parallel computation.
j = __builtin_ctzl(counter) is just giving the ruler sequence. It is tempting to use #pragma omp parallel for but the way j is calculated seems to cause a problem.
It's not so much the way j is calculated, but how it is used and the values it takes over the course of the serial computation that presents a problem. Note in particular that
- each loop iteration uses the value of
delta[j]
- each loop iteration also modifies the value of
delta[j]
j takes the same values repeatedly, many times over the course of the overall computation
Thus, your use of delta and j creates many dependencies between different iterations of your loop. Iterations cannot be reordered or run in parallel unless relative order is preserved for iterations that use the same value of j -- otherwise, the result of the computation may be altered.
And that's not all. Each iteration of your inner loop both uses and modifies s and all the elements of array v. These are more serious problems for parallelization, because they make every iteration of the outer loop dependent on all previous iterations.
How can one do this?
You need to deal with the dependency problems before this computation can make use of parallelism. If the problem were just with delta and j then perhaps you could rearrange the computation to handle all computations involving each j value together. If it were just with s then it's very likely that you could split the computation into s == 1 and s == -1 blocks, and then parallelize away. I'm having trouble, however, seeing how you could deal with both of those simultaneously, or how you could deal with the dependency involving v at all.
Perhaps you have some insight into the nature of the computation that will enable you reimplement it in a form more suitable for parallel computation. The form it's in now is inherently serial, however.
nin your case? Please provide a minimal reproducible example such that one can test if a parallel version provides a speedup and runs correctly.s = -s; p += s*prod;but maybe it can be done straightforwardly?