On my laptop (ubuntu.14 + gcc-5.x), I have AVX:
~> tail /proc/cpuinfo
model name : Intel(R) Core(TM) i7-3687U CPU @ 2.10GHz
flags : ... sse sse2 ... avx
I compile this very simple code:
~> more test.c
#include <stdio.h>
void main() {
int i=0; double a=1.;
for(i=0;i<1000000;i++) a+=i;
printf("%f\n", a); // printf: avoid compiler optim (dummy var suppression)
}
~> make
gcc -O2 -march=native -mavx -ftree-vectorize -funroll-loops -fopt-info-vec -fopt-info-loop -o test.exe test.c
test.c:4:3: note: loop unrolled 7 times
I don't understand if the loop as been "really" vectorised as the message says ! Objdump tells:
~> objdump -S test.exe | grep add
40041d: 48 83 c4 08 add $0x8,%rsp
4004a9: c5 fb 58 c2 vaddsd %xmm2,%xmm0,%xmm0
4004b8: c5 fb 58 ec vaddsd %xmm4,%xmm0,%xmm5
4004cb: c5 53 58 c7 vaddsd %xmm7,%xmm5,%xmm8
4004e1: c4 41 3b 58 da vaddsd %xmm10,%xmm8,%xmm11
4004ea: 83 c0 08 add $0x8,%eax
4004f2: c4 41 23 58 f5 vaddsd %xmm13,%xmm11,%xmm14
4004f7: c5 8b 58 d1 vaddsd %xmm1,%xmm14,%xmm2
4004fb: c5 eb 58 e3 vaddsd %xmm3,%xmm2,%xmm4
4004ff: c5 db 58 c6 vaddsd %xmm6,%xmm4,%xmm0
4005bb: 48 01 c6 add %rax,%rsi
40067d: 48 83 c3 01 add $0x1,%rbx
400686: 48 83 c4 08 add $0x8,%rsp
4006a8: 48 83 c4 08 add $0x8,%rsp
So finally, I get "vaddsd" (whit a "v" that seems to stand for "vectorised") but I do not have the "addpd" I would have expected ?...
My understanding is that "addsd" is scalar addition (= 1 regular addition), and that "addpd" is packed addition (= several additions vectorised in 1 cycle). Also, I don't understand in what "vaddsd" is different from "addpd" : are these supposed to be the same ? (google this does not give relevant answers)
Why don't I get "addpd" ? Missing compile option ? Missing hints / pragma in the code ? Or is it logical, if yes why ?
FH
UPDATE
The message says it has been vectorized but it's not, I get no speed-up :
~> more test.c
#include <stdio.h>
void main() {
unsigned int i=0; double a=1.;
for(i=0;i<3000000000;i++) a+=i;
printf("%f\n", a); // printf : avoid compiler optimisation what suppress a as it's a dummy variable !
}
~> make
gcc -O2 -march=native -o test.novec.exe test.c
gcc -O2 -march=native -mavx -ftree-vectorize -funroll-loops -fopt-info-vec -fopt-info-loop -o test.vec.exe test.c
test.c:4:3: note: loop unrolled 7 times
~> time ./test.novec.exe
4499999997067113984.000000
real 0m2.927s
user 0m2.928s
sys 0m0.000s
~> time ./test.vec.exe
4499999997067113984.000000
real 0m2.926s
user 0m2.924s
sys 0m0.000s
... Unless I add -ffast-math (or -Ofast that includes -ffast-math) :
~> make
gcc -O2 -march=native -mavx -ftree-vectorize -funroll-loops -fopt-info-vec -fopt-info-loop -ffast-math -o test.vec.fm.exe test.c
test.c:4:3: note: loop vectorized
test.c:4:30: note: loop unrolled 3 times
~> time ./test.vec.fm.exe
4499999999597346816.000000
real 0m1.980s
user 0m1.980s
sys 0m0.000s
-Ofastand it will vectorize (your will seevaddpd).-ffast-math(which you get with-Ofast) tells it to assume it's associative. This means your results with and without associative math may be different (though not necessarily less accurate). Note that ICC defaults to associative math with floating point so ICC will vectorize this with only-O3but Clang, GCC, and MSVC don't assume associative math by default.