gcc auto-vectorisation (unhandled data-ref)

Question

I do not understand why such code is not vectorized with gcc 4.4.6

int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
{
  for (int i = 0; i < iSize; i++)
     pfResult[i] = pfResult[i] + pfTab[iIndex];
}

 note: not vectorized: unhandled data-ref

However, if I write the following code

   int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
{
  float fTab =  pfTab[iIndex];
  for (int i = 0; i < iSize; i++)
     pfResult[i] = pfResult[i] + fTab;
}

gcc succeeds auto-vectorize this loop

if I add omp directive

   int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
{
  float fTab =  pfTab[iIndex];
  #pragma omp parallel for
  for (int i = 0; i < iSize; i++)
     pfResult[i] = pfResult[i] + fTab;
}

i have the following error not vectorized: unhandled data-ref

Could you please help me why the first code and third code is not auto-vectorized ?

Second question: math operand seems to be not vectorized (exp, log , etc...), this code for example

for (int i = 0; i < iSize; i++)
         pfResult[i] = exp(pfResult[i]);

is not vectorized. It is due to my version of gcc ?

Edit: with new version of gcc 4.8.1 and openMP 2011 (echo |cpp -fopenmp -dM |grep -i open) i have the following error for all kind of loop even basically

   for (iGID = 0; iGID < iSize; iGID++)
        {
             pfResult[iGID] = fValue;
        }


note: not consecutive access *_144 = 5.0e-1;
note: Failed to SLP the basic block.
note: not vectorized: failed to find SLP opportunities in basic block.

Edit2:

#include<stdio.h>
#include<sys/time.h>
#include <string.h>
#include <math.h>
#include <stdlib.h>
#include <omp.h>

int main()
{
        int szGlobalWorkSize = 131072;
        int iGID = 0;
        int j = 0;
        omp_set_dynamic(0);
        // warmup
        #if WARMUP
        #pragma omp parallel
        {
        #pragma omp master
        {
        printf("%d threads\n", omp_get_num_threads());
        }
        }
        #endif
        printf("Pagesize=%d\n", getpagesize());
        float *pfResult = (float *)malloc(szGlobalWorkSize * 100* sizeof(float));
        float fValue = 0.5f;
        struct timeval tim;
        gettimeofday(&tim, NULL);
        double tLaunch1=tim.tv_sec+(tim.tv_usec/1000000.0);
        double time = omp_get_wtime();
        int iChunk = getpagesize();
        int iSize = ((int)szGlobalWorkSize * 100) / iChunk;
        //#pragma omp parallel for
        for (iGID = 0; iGID < iSize; iGID++)
        {
             pfResult[iGID] = fValue;
        }
        time = omp_get_wtime() - time;
        gettimeofday(&tim, NULL);
        double tLaunch2=tim.tv_sec+(tim.tv_usec/1000000.0);
        printf("%.6lf Time1\n", tLaunch2-tLaunch1);
        printf("%.6lf Time2\n", time);
}

result with

#define _OPENMP 201107
gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-15)

gcc -march=native -fopenmp -O3 -ftree-vectorizer-verbose=2 test.c -lm

lot of

note: Failed to SLP the basic block.
note: not vectorized: failed to find SLP opportunities in basic block.
and note: not consecutive access *_144 = 5.0e-1;

Thanks

First thing would indeed be to try a more recent version of gcc. Then realize that without restrict vectorization could be wrong. And add -ffast-math because the compiler is scared otherwise. For exp and log, I'm sure I've seen related questions on SO. Basically, you would need to have a library that provides vector versions of exp and log so gcc could generate calls to them. — Marc Glisse
– Marc Glisse, Commented Nov 20, 2014 at 12:31
Scratch my previous comment, why aren't you using i in your loops??? — Marc Glisse
– Marc Glisse, Commented Nov 20, 2014 at 12:38
thanks a lot I have already tried with 'restrict' and const and the result is the same I will try with more recent version of gcc sorry for the typo loop — parisjohn
– parisjohn, Commented Nov 20, 2014 at 13:14
i have installed gcc 4.8.1 and now all my loop gives the following information note: Failed to SLP the basic block. note: not vectorized: failed to find SLP opportunities in basic block. — parisjohn
– parisjohn, Commented Nov 20, 2014 at 13:46
Your code doesn't compile (missing headers?), that's just rude. There are online compilers available if you want to test your code without installing anything. — Marc Glisse
– Marc Glisse, Commented Nov 20, 2014 at 16:26

Hristo Iliev · Accepted Answer · 2014-11-20 18:29:51Z

GCC cannot vectorise the first version of your loop because it cannot prove that pfTab[iIndex] is not contained somewhere within the memory spanned by pfResult[0] ... pfResult[iSize-1] (pointer aliasing). Indeed, if pfTab[iIndex] is somewhere within that memory, then its value must be overwritten by the assignment in the loop body and the new value must be used in the iterations to follow. You should use the restrict keyword to hint the compiler that this could never happen and then it should happily vectorise your code:

$ cat foo.c
int MyFunc(const float *restrict pfTab, float *restrict pfResult,
           int iSize, int iIndex)
{
   for (int i = 0; i < iSize; i++)
     pfResult[i] = pfResult[i] + pfTab[iIndex];
}
$ gcc -v
...
gcc version 4.6.1 (GCC)
$ gcc -std=c99 -O3 -march=native -ftree-vectorizer-verbose=2 -c foo.c
foo.c:3: note: LOOP VECTORIZED.
foo.c:1: note: vectorized 1 loops in function.

The second version vectorises since the value is transferred to a variable with an automatic storage duration. The general assumption here is that pfResult does not span over the stack memory where fTab is stored (a cursory read through the C99 language specification doesn't make it clear if that assumption is weak or something in the standard allows it).

The OpenMP version does not vectorise because of the way OpenMP is implemented in GCC. It uses code outlining for the parallel regions.

int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
{
  float fTab =  pfTab[iIndex];
  #pragma omp parallel for
  for (int i = 0; i < iSize; i++)
     pfResult[i] = pfResult[i] + fTab;
}

effectively becomes:

struct omp_data_s
{
  float *pfResult;
  int iSize;
  float *fTab;
};

int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
{
  float fTab =  pfTab[iIndex];
  struct omp_data_s omp_data_o;

  omp_data_o.pfResult = pfResult;
  omp_data_o.iSize = iSize;
  omp_data_o.fTab = fTab;

  GOMP_parallel_start (MyFunc_omp_fn0, &omp_data_o, 0);
  MyFunc._omp_fn.0 (&omp_data_o);
  GOMP_parallel_end ();
  pfResult = omp_data_o.pfResult;
  iSize = omp_data_o.iSize;
  fTab = omp_data_o.fTab;
}

void MyFunc_omp_fn0 (struct omp_data_s *omp_data_i)
{
  int start = ...; // compute starting iteration for current thread
  int end = ...; // compute ending iteration for current thread

  for (int i = start; i < end; i++)
    omp_data_i->pfResult[i] = omp_data_i->pfResult[i] + omp_data_i->fTab;
}

MyFunc_omp_fn0 contains the outlined function code. The compiler is not able to prove that omp_data_i->pfResult does not point to memory that aliases omp_data_i and specifically its member fTab.

In order to vectorise that loop, you have to make fTab firstprivate. This will turn it into an automatic variable in the outlined code and that will be equivalent to your second case:

$ cat foo.c
int MyFunc(const float *pfTab, float *pfResult, int iSize, int iIndex)
{
   float fTab = pfTab[iIndex];
   #pragma omp parallel for firstprivate(fTab)
   for (int i = 0; i < iSize; i++)
     pfResult[i] = pfResult[i] + fTab;
}
$ gcc -std=c99 -fopenmp -O3 -march=native -ftree-vectorizer-verbose=2 -c foo.c
foo.c:6: note: LOOP VECTORIZED.
foo.c:4: note: vectorized 1 loops in function.

Collectives™ on Stack Overflow

gcc auto-vectorisation (unhandled data-ref)

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related