MSVC seems to be taking the values from my array of coefficients and scattering them around in its .rdata section, not keeping them contiguous even though they're all used together. And it takes the absolute value of the negative ones, using vsubss instead of adding a negative with vaddss. It also seems to waste vmovaps register-copy instructions.
Also MSVC isn't using FMA instructions, but there are some existing Q&As about that.
(Big picture, I'm seeing intermittent ~40x slowdowns for some unlucky routines, which switch on or off by changing the total length of the quoted strings in the program! I don't know if the code-gen I'm asking about in this question can explain all of that or not.)
Why does MSVC sometimes create a new table of absolute value polynomial coefficients in a weird order interspersed with other data? This happens for shorter length polynomials with constant coefficients where the length and coefficients are known at compile time. Code generation unrolls the loop in all cases for both Intel and MSVC. In this example I have it as a very short MRE (and sample on Godbolt which I hope will behave):
const float p10_32[] = { 0.99996600f, 1.007032f, -0.74001284f, 3.3444971f, -21.49531f, 82.639426f, -194.32462f, 283.06584f, -249.3704f, 121.73285f, -25.28842f }; // slightly tweakedwz2
float evalpoly32_horner(double xin, const double* p)
{
float x = xin;
return p10_32[0] + x * (p10_32[1] + x * (p10_32[2] + x * (p10_32[3] + x * (p10_32[4] + x * (p10_32[5] + x * (p10_32[6] + x * (p10_32[7] + x * (p10_32[8] + x * (p10_32[9] + x * (p10_32[10]))))))))));
}
I was tormenting MSVC and ICX with a polynomial benchmark code when I noticed something a bit odd about the code generated by MSVC 17.1 when compared to ICX 2024.1 for short polynomials in this case up to x10. Fixed length known at compile time. Coefficient array and Horner code are machine generated equal ripple approximations.
The strange behaviour seems to be very specific to this particular way of expressing the formula. It doesn't clone them if there is a loop construct but it does unroll the loop to generate sum and multiply by x code for each term.
Code generation is AVX2 maximum optimisation but function inlining disabled. This affects short to modest length polynomials and I am at a complete loss to explain why MSVC does this and if there is any advantage to the rather strange order it stores the extra set of promoted to absolute value coefficients. It makes sense to do this if there is a mismatch between float coefficients and a double variable x but when both are floats why does it bother?
First the simpler and faster Intel ICX 2024.1 code which is solid FMA inlined acting on the original coefficient array.
Original float coefficients in array p10_32 with same base address (different on Intel base address is 0x07FF7DB993570h)
+ p10_32 0x00007ff75a3434e0 {0.999966025, 1.00703204, -0.740012825, 3.34449720, -21.4953098, 82.6394272, -194.324615, ...} const float[11]
Intel AVX2 float
Uses the original array in standard order
--- C:\Users\Martin\source\repos\SO_ToyAG\SO_ToyAG.cpp -------------------------
return (p10_32[0] + x * (p10_32[1] + x * (p10_32[2] + x * (p10_32[3] + x * (p10_32[4] + x * (p10_32[5] + x * (p10_32[6] + x * (p10_32[7] + x * (p10_32[8] + x * (p10_32[9] + x * (p10_32[10])))))))))));
00007FF713E31110 C5 FA 10 0D 90 20 00 00 vmovss xmm1,dword ptr [__real@c1ca4eaf (07FF713E331A8h)]
00007FF713E31118 C4 E2 79 A9 0D 8B 20 00 00 vfmadd213ss xmm1,xmm0,dword ptr [__real@42f37738 (07FF713E331ACh)]
00007FF713E31121 C4 E2 79 A9 0D 86 20 00 00 vfmadd213ss xmm1,xmm0,dword ptr [__real@c3795ed3 (07FF713E331B0h)]
00007FF713E3112A C4 E2 79 A9 0D 81 20 00 00 vfmadd213ss xmm1,xmm0,dword ptr [__real@438d886d (07FF713E331B4h)]
00007FF713E31133 C4 E2 79 A9 0D 7C 20 00 00 vfmadd213ss xmm1,xmm0,dword ptr [__real@c342531a (07FF713E331B8h)]
00007FF713E3113C C4 E2 79 A9 0D 77 20 00 00 vfmadd213ss xmm1,xmm0,dword ptr [__real@42a54763 (07FF713E331BCh)]
00007FF713E31145 C4 E2 79 A9 0D 72 20 00 00 vfmadd213ss xmm1,xmm0,dword ptr [__real@c1abf665 (07FF713E331C0h)]
00007FF713E3114E C4 E2 79 A9 0D 6D 20 00 00 vfmadd213ss xmm1,xmm0,dword ptr [__real@40560c3e (07FF713E331C4h)]
00007FF713E31157 C4 E2 79 A9 0D 68 20 00 00 vfmadd213ss xmm1,xmm0,dword ptr [__real@bf3d717b (07FF713E331C8h)]
00007FF713E31160 C4 E2 79 A9 0D 63 20 00 00 vfmadd213ss xmm1,xmm0,dword ptr [__real@3f80e66d (07FF713E331CCh)]
00007FF713E31169 C4 E2 71 A9 05 5E 20 00 00 vfmadd213ss xmm0,xmm1,dword ptr [__real@3f7ffdc6 (07FF713E331D0h)]
00007FF713E31172 C3 ret
Note that Intel accesses the original coefficients array in strict sequential order.
Now for the MS code which is rather odd.
Note a new set of absolute value coefficients in float format have been created with base address approx 0x07ff75a34356c - they are not in original order or even contiguous in memory! Various junk is in between them - nothing obviously recognisable. Below is the mapping between the coefficient name original array location and working copy location.
MSVC AVX2 float
return (p10_32[0] + x * (p10_32[1] + x * (p10_32[2] + x * (p10_32[3] + x * (p10_32[4] + x * (p10_32[5] + x * (p10_32[6] + x * (p10_32[7] + x * (p10_32[8] + x * (p10_32[9] + x * (p10_32[10])))))))))));
primary copy 4e0 4e8 4f0 4f8 500 508 510 518 520 528 530
float working copy 570 578 56c 5b8 5e4 5f0 5f8 600 5fc 5f4 5e8
--- C:\Users\Martin\source\repos\SO_ToyAG\SO_ToyAG.cpp -------------------------
float x = (float)xin;
return (p10_32[0] + x * (p10_32[1] + x * (p10_32[2] + x * (p10_32[3] + x * (p10_32[4] + x * (p10_32[5] + x * (p10_32[6] + x * (p10_32[7] + x * (p10_32[8] + x * (p10_32[9] + x * (p10_32[10])))))))))));
00007FF7DB991180 C5 FA 59 15 60 24 00 00 vmulss xmm2,xmm0,dword ptr [__real@41ca4eaf (07FF7DB9935E8h)]
00007FF7DB991188 C5 FA 10 0D 64 24 00 00 vmovss xmm1,dword ptr [__real@42f37738 (07FF7DB9935F4h)]
00007FF7DB991190 C5 F2 5C D2 vsubss xmm2,xmm1,xmm2
00007FF7DB991194 C5 EA 59 D8 vmulss xmm3,xmm2,xmm0 xmm2 = p9+x*p10, xmm3 = x*(p9+x*p10)
00007FF7DB991198 C5 E2 5C 25 5C 24 00 00 vsubss xmm4,xmm3,dword ptr [__real@43795ed3 (07FF7DB9935FCh)] xmm4 = xmm3 - p8
00007FF7DB9911A0 C5 DA 59 C8 vmulss xmm1,xmm4,xmm0 xmm1 = x*(xmm3-p8)
00007FF7DB9911A4 C5 F2 58 15 54 24 00 00 vaddss xmm2,xmm1,dword ptr [__real@438d886d (07FF7DB993600h)] xmm2 = xmm1 + p7
00007FF7DB9911AC C5 EA 59 D8 vmulss xmm3,xmm2,xmm0 xmm3 = x*xmm2
00007FF7DB9911B0 C5 E2 5C 25 40 24 00 00 vsubss xmm4,xmm3,dword ptr [__real@4342531a (07FF7DB9935F8h)]
00007FF7DB9911B8 C5 DA 59 C8 vmulss xmm1,xmm4,xmm0 xmm1 = x*(xmm3 - p6)
00007FF7DB9911BC C5 F2 58 15 2C 24 00 00 vaddss xmm2,xmm1,dword ptr [__real@42a54763 (07FF7DB9935F0h)]
00007FF7DB9911C4 C5 EA 59 D8 vmulss xmm3,xmm2,xmm0 xmm3= x*(xmm1 + p5)
00007FF7DB9911C8 C5 E2 5C 25 14 24 00 00 vsubss xmm4,xmm3,dword ptr [__real@41abf665 (07FF7DB9935E4h)] xmm4= (xmm1 - p4)
00007FF7DB9911D0 C5 F8 28 E8 vmovaps xmm5,xmm0 xmm5 = xmm0 why????
00007FF7DB9911D4 C5 DA 59 C0 vmulss xmm0,xmm4,xmm0 xmm0 = xmm4
00007FF7DB9911D8 C5 FA 58 0D D8 23 00 00 vaddss xmm1,xmm0,dword ptr [__real@40560c3e (07FF7DB9935B8h)]
00007FF7DB9911E0 C5 F2 59 D5 vmulss xmm2,xmm1,xmm5 xmm2 = x*(xmm0 + p3)
00007FF7DB9911E4 C5 EA 5C 1D 80 23 00 00 vsubss xmm3,xmm2,dword ptr [__real@3f3d717b (07FF7DB99356Ch)]
00007FF7DB9911EC C5 E2 59 C5 vmulss xmm0,xmm3,xmm5 xmm0 = x*(xmm2 - p2)
00007FF7DB9911F0 C5 FA 58 0D 80 23 00 00 vaddss xmm1,xmm0,dword ptr [__real@3f80e66d (07FF7DB993578h)]
00007FF7DB9911F8 C5 F2 59 D5 vmulss xmm2,xmm1,xmm5 xmm2 = x*(xmm0 + p1)
00007FF7DB9911FC C5 EA 58 05 6C 23 00 00 vaddss xmm0,xmm2,dword ptr [__real@3f7ffdc6 (07FF7DB993570h)] xmm0 = xmm2 + p0
}
00007FF7DB991204 C3 ret
The snapshot is from the debugger and comments are mine. I'm really puzzled by why it has cloned the provided coefficients with their absolute values and then alternates between vaddss and vsubss during evaluation. Intel's code generation is significantly faster. I'm also curious about its use of so many registers. I thought that hardware register colouring meant it could just hammer a register pair without any consequences (like the faster Intel code does).
I can see no logic at all to the locations that it stores the various coefficients. I thought at first it must be some cunning cache trick but if it is then I can't figure it out.
I put polyeval snippet on Godbolt and there it generates more sane code xmm0,1 compared to my local installed copy with the same version 17.1 tag. But the creation of the absolute value coefficients remains and it doesn't use FMA even with /O2 /arch:AVX2. Another oddity when I built it up constants first then the routine on Godbolt I got tidier code. Now double checking the link I added here on Godbolt again I see it hitting xmm0 through 4 cyclically same code same compiler. No changes to the code. You can't tell from Godbolt but the addresses of the absolute valued coefficients are in a random order and not contiguous in memory.
I am completely mystified by why it trashes xmm5 with xmm0 in my debugger disassembly (input value x) part way through (to be fair the MS compiler on Godbolt doesn't do this). It needs to return the result in xmm0 but it can do that easily enough.
Why is it being so profligate with register usage and does it have any advantage? Looking at Godbolt that aspect seems to be a little local difficulty with my compiler doing slightly odd code generation. I think maybe time to upgrade it again. Then I went back to check the link before posting and it misbehaved almost the same as here using many xxm registers. Intel code is as frugal as you can get - just xmm0 & xmm1 (and faster)!
MS compiler seems reluctant to generate FMA code. Am I missing something obvious here?
/O2 /arch:AVX2may not be enough. As you noted, it's not using thevfmadd213ssinstruction. For example, on my oldergcc[8.3.1], with just-O3 -march=corei7-avx, it generatesvmulss/vaddss/vsubsssequences. To getvfmadd212ss, I had to use-O3 -march=skylake-avx512Without-march=, compilers will [usually] try to probe the local machine (e.g.cpuid) and build for the max (if-march=native). So,icxmay just have better defaults.fmaversion by coding it explicitly but it makes a right dog's dinner of that. Alas there is no -O3 on MS. Another oddity is that /arch:SSE2 evokes the error messageWarning D9002 ignoring unknown option '/arch:SSE2'but then generates SSE2 code!-march=baseline, just x86-64, so binaries can run anywhere. The opposite if-march=native. ICX defaults to some level of-ffast-math(at least ICC-classic did), but even it doesn't use CPU features like AVX or FMA without-marchoptions, since making binaries that fail on other people's computers is a worse downside than the upside of looking good on benchmarks for naive users who don't realize its aggressive defaults. godbolt.org/z/r4W9x46ss shows ICX usingmulssandaddss(notvmulss/vaddss) without any-marchoptions.a*b + cinto an FMA, see Difference in gcc -ffp-contract options re: that and#pragma STDC FP_CONTRACT(which LLVM respects; GCC only implements off and fast, not "on" (within single expressions not across statements)). And learn.microsoft.com/en-us/cpp/preprocessor/… for MSVC's#pragma fp_contract (on). Apparently MSVC didn't used to respect that, and only contracted with fast-math: Automatically generate FMA instructions in MSVC