simple loop: vectorisation with gcc

Question

On my laptop (ubuntu.14 + gcc-5.x), I have AVX:

~> tail /proc/cpuinfo 
   model name   : Intel(R) Core(TM) i7-3687U CPU @ 2.10GHz
   flags        : ... sse sse2 ... avx

I compile this very simple code:

~> more test.c 
  #include <stdio.h>
  void main() {
    int i=0; double a=1.;
    for(i=0;i<1000000;i++) a+=i;
    printf("%f\n", a); // printf: avoid compiler optim (dummy var suppression)
  }
~> make
   gcc -O2 -march=native -mavx -ftree-vectorize -funroll-loops -fopt-info-vec -fopt-info-loop -o test.exe test.c
   test.c:4:3: note: loop unrolled 7 times

I don't understand if the loop as been "really" vectorised as the message says ! Objdump tells:

~> objdump -S test.exe | grep add
   40041d:  48 83 c4 08             add    $0x8,%rsp
   4004a9:  c5 fb 58 c2             vaddsd %xmm2,%xmm0,%xmm0
   4004b8:  c5 fb 58 ec             vaddsd %xmm4,%xmm0,%xmm5
   4004cb:  c5 53 58 c7             vaddsd %xmm7,%xmm5,%xmm8
   4004e1:  c4 41 3b 58 da          vaddsd %xmm10,%xmm8,%xmm11
   4004ea:  83 c0 08                add    $0x8,%eax
   4004f2:  c4 41 23 58 f5          vaddsd %xmm13,%xmm11,%xmm14
   4004f7:  c5 8b 58 d1             vaddsd %xmm1,%xmm14,%xmm2
   4004fb:  c5 eb 58 e3             vaddsd %xmm3,%xmm2,%xmm4
   4004ff:  c5 db 58 c6             vaddsd %xmm6,%xmm4,%xmm0
   4005bb:  48 01 c6                add    %rax,%rsi
   40067d:  48 83 c3 01             add    $0x1,%rbx
   400686:  48 83 c4 08             add    $0x8,%rsp
   4006a8:  48 83 c4 08             add    $0x8,%rsp

So finally, I get "vaddsd" (whit a "v" that seems to stand for "vectorised") but I do not have the "addpd" I would have expected ?...

My understanding is that "addsd" is scalar addition (= 1 regular addition), and that "addpd" is packed addition (= several additions vectorised in 1 cycle). Also, I don't understand in what "vaddsd" is different from "addpd" : are these supposed to be the same ? (google this does not give relevant answers)

Why don't I get "addpd" ? Missing compile option ? Missing hints / pragma in the code ? Or is it logical, if yes why ?

FH

UPDATE

The message says it has been vectorized but it's not, I get no speed-up :

~> more test.c 
   #include <stdio.h>
   void main() {
     unsigned int i=0; double a=1.;
     for(i=0;i<3000000000;i++) a+=i;
     printf("%f\n", a); // printf : avoid compiler optimisation what suppress a as it's a dummy variable !
   }
~> make
   gcc -O2 -march=native -o test.novec.exe test.c
   gcc -O2 -march=native -mavx -ftree-vectorize -funroll-loops -fopt-info-vec -fopt-info-loop -o test.vec.exe test.c
   test.c:4:3: note: loop unrolled 7 times
~> time ./test.novec.exe 
   4499999997067113984.000000
   real 0m2.927s
   user 0m2.928s
   sys  0m0.000s
~> time ./test.vec.exe
   4499999997067113984.000000
   real 0m2.926s
   user 0m2.924s
   sys  0m0.000s

... Unless I add -ffast-math (or -Ofast that includes -ffast-math) :

~> make
   gcc -O2 -march=native -mavx -ftree-vectorize -funroll-loops -fopt-info-vec -fopt-info-loop -ffast-math -o test.vec.fm.exe test.c
   test.c:4:3: note: loop vectorized
   test.c:4:30: note: loop unrolled 3 times
~> time ./test.vec.fm.exe 
   4499999999597346816.000000
   real 0m1.980s
   user 0m1.980s
   sys  0m0.000s

The transformation you are hoping for reorders a sum of floating point numbers. To be safe, gcc only does it if you tell it it is ok (-ffast-math for instance). — Marc Glisse
– Marc Glisse, Commented Mar 8, 2016 at 14:49
Compile with -Ofast and it will vectorize (your will see vaddpd). — Z boson
– Z boson, Commented Mar 8, 2016 at 19:12
Yes, I have described this many times. You are doing a reduction. In order to use SIMD with reductions the operations need to be associative. Float point arithmetic is not associative but -ffast-math (which you get with -Ofast) tells it to assume it's associative. This means your results with and without associative math may be different (though not necessarily less accurate). Note that ICC defaults to associative math with floating point so ICC will vectorize this with only -O3 but Clang, GCC, and MSVC don't assume associative math by default. — Z boson
– Z boson, Commented Mar 9, 2016 at 10:04

Mike Frysinger · Accepted Answer · 2016-03-08 14:53:13Z

1

addsd and addpd are the legacy SSE2 SIMD insns. vaddsd and vaddpd are the newer AVX SIMD insns. this page seems to provide a good comparison between the two: it's a more flexible encoding with higher precision.

answered Mar 8, 2016 at 14:53

Mike Frysinger

3,1161 gold badge24 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Peter Cordes Over a year ago

There's no difference in precision between AVX and SSE; each of the double-precision FP addition operations produces the correctly-rounded result, as required for IEEE basic operations (+ - * / and sqrt). AVX lets you use wider vectors to process more elements with one instruction, but vaddsd is still just scalar.

Collectives™ on Stack Overflow

simple loop: vectorisation with gcc

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related