0

On my laptop (ubuntu.14 + gcc-5.x), I have AVX:

~> tail /proc/cpuinfo 
   model name   : Intel(R) Core(TM) i7-3687U CPU @ 2.10GHz
   flags        : ... sse sse2 ... avx

I compile this very simple code:

~> more test.c 
  #include <stdio.h>
  void main() {
    int i=0; double a=1.;
    for(i=0;i<1000000;i++) a+=i;
    printf("%f\n", a); // printf: avoid compiler optim (dummy var suppression)
  }
~> make
   gcc -O2 -march=native -mavx -ftree-vectorize -funroll-loops -fopt-info-vec -fopt-info-loop -o test.exe test.c
   test.c:4:3: note: loop unrolled 7 times

I don't understand if the loop as been "really" vectorised as the message says ! Objdump tells:

~> objdump -S test.exe | grep add
   40041d:  48 83 c4 08             add    $0x8,%rsp
   4004a9:  c5 fb 58 c2             vaddsd %xmm2,%xmm0,%xmm0
   4004b8:  c5 fb 58 ec             vaddsd %xmm4,%xmm0,%xmm5
   4004cb:  c5 53 58 c7             vaddsd %xmm7,%xmm5,%xmm8
   4004e1:  c4 41 3b 58 da          vaddsd %xmm10,%xmm8,%xmm11
   4004ea:  83 c0 08                add    $0x8,%eax
   4004f2:  c4 41 23 58 f5          vaddsd %xmm13,%xmm11,%xmm14
   4004f7:  c5 8b 58 d1             vaddsd %xmm1,%xmm14,%xmm2
   4004fb:  c5 eb 58 e3             vaddsd %xmm3,%xmm2,%xmm4
   4004ff:  c5 db 58 c6             vaddsd %xmm6,%xmm4,%xmm0
   4005bb:  48 01 c6                add    %rax,%rsi
   40067d:  48 83 c3 01             add    $0x1,%rbx
   400686:  48 83 c4 08             add    $0x8,%rsp
   4006a8:  48 83 c4 08             add    $0x8,%rsp

So finally, I get "vaddsd" (whit a "v" that seems to stand for "vectorised") but I do not have the "addpd" I would have expected ?...

My understanding is that "addsd" is scalar addition (= 1 regular addition), and that "addpd" is packed addition (= several additions vectorised in 1 cycle). Also, I don't understand in what "vaddsd" is different from "addpd" : are these supposed to be the same ? (google this does not give relevant answers)

Why don't I get "addpd" ? Missing compile option ? Missing hints / pragma in the code ? Or is it logical, if yes why ?

FH

UPDATE

The message says it has been vectorized but it's not, I get no speed-up :

~> more test.c 
   #include <stdio.h>
   void main() {
     unsigned int i=0; double a=1.;
     for(i=0;i<3000000000;i++) a+=i;
     printf("%f\n", a); // printf : avoid compiler optimisation what suppress a as it's a dummy variable !
   }
~> make
   gcc -O2 -march=native -o test.novec.exe test.c
   gcc -O2 -march=native -mavx -ftree-vectorize -funroll-loops -fopt-info-vec -fopt-info-loop -o test.vec.exe test.c
   test.c:4:3: note: loop unrolled 7 times
~> time ./test.novec.exe 
   4499999997067113984.000000
   real 0m2.927s
   user 0m2.928s
   sys  0m0.000s
~> time ./test.vec.exe
   4499999997067113984.000000
   real 0m2.926s
   user 0m2.924s
   sys  0m0.000s

... Unless I add -ffast-math (or -Ofast that includes -ffast-math) :

~> make
   gcc -O2 -march=native -mavx -ftree-vectorize -funroll-loops -fopt-info-vec -fopt-info-loop -ffast-math -o test.vec.fm.exe test.c
   test.c:4:3: note: loop vectorized
   test.c:4:30: note: loop unrolled 3 times
~> time ./test.vec.fm.exe 
   4499999999597346816.000000
   real 0m1.980s
   user 0m1.980s
   sys  0m0.000s
5
  • The message nowhere says it has been vectorized... Commented Mar 8, 2016 at 14:45
  • The transformation you are hoping for reorders a sum of floating point numbers. To be safe, gcc only does it if you tell it it is ok (-ffast-math for instance). Commented Mar 8, 2016 at 14:49
  • Compile with -Ofast and it will vectorize (your will see vaddpd). Commented Mar 8, 2016 at 19:12
  • seems you only get speed-up IF you add -ffast-math option Commented Mar 9, 2016 at 9:20
  • Yes, I have described this many times. You are doing a reduction. In order to use SIMD with reductions the operations need to be associative. Float point arithmetic is not associative but -ffast-math (which you get with -Ofast) tells it to assume it's associative. This means your results with and without associative math may be different (though not necessarily less accurate). Note that ICC defaults to associative math with floating point so ICC will vectorize this with only -O3 but Clang, GCC, and MSVC don't assume associative math by default. Commented Mar 9, 2016 at 10:04

1 Answer 1

1

addsd and addpd are the legacy SSE2 SIMD insns. vaddsd and vaddpd are the newer AVX SIMD insns. this page seems to provide a good comparison between the two: it's a more flexible encoding with higher precision.

Sign up to request clarification or add additional context in comments.

1 Comment

There's no difference in precision between AVX and SSE; each of the double-precision FP addition operations produces the correctly-rounded result, as required for IEEE basic operations (+ - * / and sqrt). AVX lets you use wider vectors to process more elements with one instruction, but vaddsd is still just scalar.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.