4

MSVC seems to be taking the values from my array of coefficients and scattering them around in its .rdata section, not keeping them contiguous even though they're all used together. And it takes the absolute value of the negative ones, using vsubss instead of adding a negative with vaddss. It also seems to waste vmovaps register-copy instructions.

Also MSVC isn't using FMA instructions, but there are some existing Q&As about that.

(Big picture, I'm seeing intermittent ~40x slowdowns for some unlucky routines, which switch on or off by changing the total length of the quoted strings in the program! I don't know if the code-gen I'm asking about in this question can explain all of that or not.)


Why does MSVC sometimes create a new table of absolute value polynomial coefficients in a weird order interspersed with other data? This happens for shorter length polynomials with constant coefficients where the length and coefficients are known at compile time. Code generation unrolls the loop in all cases for both Intel and MSVC. In this example I have it as a very short MRE (and sample on Godbolt which I hope will behave):

const float p10_32[] = { 0.99996600f,          1.007032f,         -0.74001284f,          3.3444971f,        -21.49531f,                 82.639426f,        -194.32462f,                283.06584f,        -249.3704f,             121.73285f,         -25.28842f };  // slightly tweakedwz2

float evalpoly32_horner(double xin, const double* p)
{
    float x = xin;
    return p10_32[0] + x * (p10_32[1] + x * (p10_32[2] + x * (p10_32[3] + x * (p10_32[4] + x * (p10_32[5] + x * (p10_32[6] + x * (p10_32[7] + x * (p10_32[8] + x * (p10_32[9] + x * (p10_32[10]))))))))));
}

I was tormenting MSVC and ICX with a polynomial benchmark code when I noticed something a bit odd about the code generated by MSVC 17.1 when compared to ICX 2024.1 for short polynomials in this case up to x10. Fixed length known at compile time. Coefficient array and Horner code are machine generated equal ripple approximations.

The strange behaviour seems to be very specific to this particular way of expressing the formula. It doesn't clone them if there is a loop construct but it does unroll the loop to generate sum and multiply by x code for each term.

Code generation is AVX2 maximum optimisation but function inlining disabled. This affects short to modest length polynomials and I am at a complete loss to explain why MSVC does this and if there is any advantage to the rather strange order it stores the extra set of promoted to absolute value coefficients. It makes sense to do this if there is a mismatch between float coefficients and a double variable x but when both are floats why does it bother?

First the simpler and faster Intel ICX 2024.1 code which is solid FMA inlined acting on the original coefficient array.

Original float coefficients in array p10_32 with same base address (different on Intel base address is 0x07FF7DB993570h)

+       p10_32  0x00007ff75a3434e0 {0.999966025, 1.00703204, -0.740012825, 3.34449720, -21.4953098, 82.6394272, -194.324615, ...}   const float[11]

Intel AVX2 float
Uses the original array in standard order

--- C:\Users\Martin\source\repos\SO_ToyAG\SO_ToyAG.cpp -------------------------
    return (p10_32[0] + x * (p10_32[1] + x * (p10_32[2] + x * (p10_32[3] + x * (p10_32[4] + x * (p10_32[5] + x * (p10_32[6] + x * (p10_32[7] + x * (p10_32[8] + x * (p10_32[9] + x * (p10_32[10])))))))))));
00007FF713E31110 C5 FA 10 0D 90 20 00 00 vmovss      xmm1,dword ptr [__real@c1ca4eaf (07FF713E331A8h)]  
00007FF713E31118 C4 E2 79 A9 0D 8B 20 00 00 vfmadd213ss xmm1,xmm0,dword ptr [__real@42f37738 (07FF713E331ACh)]  
00007FF713E31121 C4 E2 79 A9 0D 86 20 00 00 vfmadd213ss xmm1,xmm0,dword ptr [__real@c3795ed3 (07FF713E331B0h)]  
00007FF713E3112A C4 E2 79 A9 0D 81 20 00 00 vfmadd213ss xmm1,xmm0,dword ptr [__real@438d886d (07FF713E331B4h)]  
00007FF713E31133 C4 E2 79 A9 0D 7C 20 00 00 vfmadd213ss xmm1,xmm0,dword ptr [__real@c342531a (07FF713E331B8h)]  
00007FF713E3113C C4 E2 79 A9 0D 77 20 00 00 vfmadd213ss xmm1,xmm0,dword ptr [__real@42a54763 (07FF713E331BCh)]  
00007FF713E31145 C4 E2 79 A9 0D 72 20 00 00 vfmadd213ss xmm1,xmm0,dword ptr [__real@c1abf665 (07FF713E331C0h)]  
00007FF713E3114E C4 E2 79 A9 0D 6D 20 00 00 vfmadd213ss xmm1,xmm0,dword ptr [__real@40560c3e (07FF713E331C4h)]  
00007FF713E31157 C4 E2 79 A9 0D 68 20 00 00 vfmadd213ss xmm1,xmm0,dword ptr [__real@bf3d717b (07FF713E331C8h)]  
00007FF713E31160 C4 E2 79 A9 0D 63 20 00 00 vfmadd213ss xmm1,xmm0,dword ptr [__real@3f80e66d (07FF713E331CCh)]  
00007FF713E31169 C4 E2 71 A9 05 5E 20 00 00 vfmadd213ss xmm0,xmm1,dword ptr [__real@3f7ffdc6 (07FF713E331D0h)]  
00007FF713E31172 C3                   ret  

Note that Intel accesses the original coefficients array in strict sequential order.

Now for the MS code which is rather odd.

Note a new set of absolute value coefficients in float format have been created with base address approx 0x07ff75a34356c - they are not in original order or even contiguous in memory! Various junk is in between them - nothing obviously recognisable. Below is the mapping between the coefficient name original array location and working copy location.

MSVC AVX2 float 

    return         (p10_32[0] + x * (p10_32[1] + x * (p10_32[2] + x * (p10_32[3] + x * (p10_32[4] + x * (p10_32[5] + x * (p10_32[6] + x * (p10_32[7] + x * (p10_32[8] + x * (p10_32[9] + x * (p10_32[10])))))))))));
      primary copy    4e0                4e8              4f0            4f8                500               508               510              518              520              528            530
float working copy    570                578              56c            5b8                5e4               5f0               5f8              600              5fc              5f4             5e8


--- C:\Users\Martin\source\repos\SO_ToyAG\SO_ToyAG.cpp -------------------------
    float x = (float)xin;
    return (p10_32[0] + x * (p10_32[1] + x * (p10_32[2] + x * (p10_32[3] + x * (p10_32[4] + x * (p10_32[5] + x * (p10_32[6] + x * (p10_32[7] + x * (p10_32[8] + x * (p10_32[9] + x * (p10_32[10])))))))))));
00007FF7DB991180 C5 FA 59 15 60 24 00 00 vmulss      xmm2,xmm0,dword ptr [__real@41ca4eaf (07FF7DB9935E8h)]  
00007FF7DB991188 C5 FA 10 0D 64 24 00 00 vmovss      xmm1,dword ptr [__real@42f37738 (07FF7DB9935F4h)]  
00007FF7DB991190 C5 F2 5C D2          vsubss      xmm2,xmm1,xmm2  
00007FF7DB991194 C5 EA 59 D8          vmulss      xmm3,xmm2,xmm0                        xmm2 = p9+x*p10, xmm3 = x*(p9+x*p10)
00007FF7DB991198 C5 E2 5C 25 5C 24 00 00 vsubss      xmm4,xmm3,dword ptr [__real@43795ed3 (07FF7DB9935FCh)]     xmm4 = xmm3 - p8
00007FF7DB9911A0 C5 DA 59 C8          vmulss      xmm1,xmm4,xmm0                                                xmm1 = x*(xmm3-p8)
00007FF7DB9911A4 C5 F2 58 15 54 24 00 00 vaddss      xmm2,xmm1,dword ptr [__real@438d886d (07FF7DB993600h)]     xmm2 = xmm1 + p7
00007FF7DB9911AC C5 EA 59 D8          vmulss      xmm3,xmm2,xmm0                                                xmm3 = x*xmm2
00007FF7DB9911B0 C5 E2 5C 25 40 24 00 00 vsubss      xmm4,xmm3,dword ptr [__real@4342531a (07FF7DB9935F8h)]  
00007FF7DB9911B8 C5 DA 59 C8          vmulss      xmm1,xmm4,xmm0                                                xmm1 = x*(xmm3 - p6)
00007FF7DB9911BC C5 F2 58 15 2C 24 00 00 vaddss      xmm2,xmm1,dword ptr [__real@42a54763 (07FF7DB9935F0h)]  
00007FF7DB9911C4 C5 EA 59 D8          vmulss      xmm3,xmm2,xmm0                                                xmm3= x*(xmm1 + p5)

00007FF7DB9911C8 C5 E2 5C 25 14 24 00 00 vsubss      xmm4,xmm3,dword ptr [__real@41abf665 (07FF7DB9935E4h)]     xmm4= (xmm1 - p4)
00007FF7DB9911D0 C5 F8 28 E8          vmovaps     xmm5,xmm0                                                     xmm5 = xmm0         why????
00007FF7DB9911D4 C5 DA 59 C0          vmulss      xmm0,xmm4,xmm0                                                xmm0 = xmm4
00007FF7DB9911D8 C5 FA 58 0D D8 23 00 00 vaddss      xmm1,xmm0,dword ptr [__real@40560c3e (07FF7DB9935B8h)]  
00007FF7DB9911E0 C5 F2 59 D5          vmulss      xmm2,xmm1,xmm5                                                xmm2 = x*(xmm0 + p3)
00007FF7DB9911E4 C5 EA 5C 1D 80 23 00 00 vsubss      xmm3,xmm2,dword ptr [__real@3f3d717b (07FF7DB99356Ch)]  
00007FF7DB9911EC C5 E2 59 C5          vmulss      xmm0,xmm3,xmm5                                                xmm0 = x*(xmm2 - p2)
00007FF7DB9911F0 C5 FA 58 0D 80 23 00 00 vaddss      xmm1,xmm0,dword ptr [__real@3f80e66d (07FF7DB993578h)]  
00007FF7DB9911F8 C5 F2 59 D5          vmulss      xmm2,xmm1,xmm5                                                xmm2 = x*(xmm0 + p1)
00007FF7DB9911FC C5 EA 58 05 6C 23 00 00 vaddss      xmm0,xmm2,dword ptr [__real@3f7ffdc6 (07FF7DB993570h)]     xmm0 = xmm2 + p0
}
00007FF7DB991204 C3                   ret  

The snapshot is from the debugger and comments are mine. I'm really puzzled by why it has cloned the provided coefficients with their absolute values and then alternates between vaddss and vsubss during evaluation. Intel's code generation is significantly faster. I'm also curious about its use of so many registers. I thought that hardware register colouring meant it could just hammer a register pair without any consequences (like the faster Intel code does).

I can see no logic at all to the locations that it stores the various coefficients. I thought at first it must be some cunning cache trick but if it is then I can't figure it out.

I put polyeval snippet on Godbolt and there it generates more sane code xmm0,1 compared to my local installed copy with the same version 17.1 tag. But the creation of the absolute value coefficients remains and it doesn't use FMA even with /O2 /arch:AVX2. Another oddity when I built it up constants first then the routine on Godbolt I got tidier code. Now double checking the link I added here on Godbolt again I see it hitting xmm0 through 4 cyclically same code same compiler. No changes to the code. You can't tell from Godbolt but the addresses of the absolute valued coefficients are in a random order and not contiguous in memory.

I am completely mystified by why it trashes xmm5 with xmm0 in my debugger disassembly (input value x) part way through (to be fair the MS compiler on Godbolt doesn't do this). It needs to return the result in xmm0 but it can do that easily enough.

Why is it being so profligate with register usage and does it have any advantage? Looking at Godbolt that aspect seems to be a little local difficulty with my compiler doing slightly odd code generation. I think maybe time to upgrade it again. Then I went back to check the link before posting and it misbehaved almost the same as here using many xxm registers. Intel code is as frugal as you can get - just xmm0 & xmm1 (and faster)!

MS compiler seems reluctant to generate FMA code. Am I missing something obvious here?

10
  • Just a guess ... Giving the MS compiler just /O2 /arch:AVX2 may not be enough. As you noted, it's not using the vfmadd213ss instruction. For example, on my older gcc [8.3.1], with just -O3 -march=corei7-avx, it generates vmulss/vaddss/vsubss sequences. To get vfmadd212ss, I had to use -O3 -march=skylake-avx512 Without -march=, compilers will [usually] try to probe the local machine (e.g. cpuid) and build for the max (if -march=native). So, icx may just have better defaults. Commented Jul 11 at 23:49
  • @CraigEstey I'm mainly interested in why it makes the coefficients all positive and spatters them apparently randomly in the static data space. I suspect it is related to a weird slow down problem I sometimes see only when using MSVC. I can force an fma version by coding it explicitly but it makes a right dog's dinner of that. Alas there is no -O3 on MS. Another oddity is that /arch:SSE2 evokes the error message Warning D9002 ignoring unknown option '/arch:SSE2' but then generates SSE2 code! Commented Jul 12 at 8:02
  • @CraigEstey: Most compilers default to -march= baseline, just x86-64, so binaries can run anywhere. The opposite if -march=native. ICX defaults to some level of -ffast-math (at least ICC-classic did), but even it doesn't use CPU features like AVX or FMA without -march options, since making binaries that fail on other people's computers is a worse downside than the upside of looking good on benchmarks for naive users who don't realize its aggressive defaults. godbolt.org/z/r4W9x46ss shows ICX using mulss and addss (not vmulss / vaddss) without any -march options. Commented Jul 12 at 13:38
  • @Martin: Re: contracting a*b + c into an FMA, see Difference in gcc -ffp-contract options re: that and #pragma STDC FP_CONTRACT (which LLVM respects; GCC only implements off and fast, not "on" (within single expressions not across statements)). And learn.microsoft.com/en-us/cpp/preprocessor/… for MSVC's #pragma fp_contract (on). Apparently MSVC didn't used to respect that, and only contracted with fast-math: Automatically generate FMA instructions in MSVC Commented Jul 12 at 13:43
  • 1
    @PeterCordes Thanks for tidying it up. That's a much better summary of my problem. It is really more of a curiosity than anything else. I have several related polynomial evaluation questions arising out of this series of experiments but all of them need a lot more work to make a good MRE. This one was the easiest. Commented Jul 12 at 14:40

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.