0

I'm currently working on optimizing a kernel, and one of the most time-consuming loops, despite optimization efforts, still accounts for 80% of the benchmark's execution time. The loop's performance bottleneck stems from its memory-bound nature, primarily due to an indirect memory access pattern that hampers prefetching. I'm seeking guidance on how to hint to the compiler to perform indirect prefetching without compromising other optimizations.

Here's the relevant snippet of the source code:

   for (j = 0; j < lastrow - firstrow + 1; j++) {
    sum = 0.0;
    for (k = rowstr[j]; k < rowstr[j+1]; k++) {
        sum = sum + a[k] * p[colidx[k]];
    }
       q[j] = sum;
   }

The bottleneck arises from the indirect memory access pattern, particularly in accessing the p array at the colidx[k] position.

After compiling with the Intel compiler using the following flags: -O3 -xhost -mcmodel=medium -qopt-prefetch=5 -xCOMMON-AVX512 It seems the compiler unrolled the loop by 8-factor, vectorized and added some prefetch instructions due to high qopt-prefetch level.

Block 1:
vmovupdy  (%r11,%r10,4), %ymm15
kxnorw %k0, %k0, %k1
vpxord %zmm16, %zmm16, %zmm16
vgatherdpdq  (%r9,%ymm15,8), %k1, %zmm16
vfmadd231pdz  (%rbx,%r10,8), %zmm16, %zmm14
prefetcht0z  0x900(%rbx,%r10,8)
prefetcht0z  0x480(%r11,%r10,4)
add $0x8, %r10
cmp %rsi, %r10
jbe 0x403370 <Block 1>

I've experimented with hints and tried using the volatile keyword, but it seems like I'm forced to choose between vectorization and prefetching p[colidx[k+PREFETCH_FACTOR]]. However, I believe it's possible to leverage both for optimal performance.

Would appreciate any insights or suggestions on how to achieve this balance. Thank you!

4
  • If the indices of colidx[k] are large and unpredictable and rowstr[j+1] is big, then data will likely be stored in the RAM and I expect vectorization to have nearly no impact (even the choice of the instructions of the whole loop actually). Prefetching is indeed the key to improve performance, though the prefetching window is certainly not optimal. In fact, the best window is often dependent of where data is stored (e.g. L1, L2, L3, RAM). If it is stored in the L1, then optimizations matters and you can still optimize the kernel, if it is in DRAM, then only the prefetch window matters. Commented Mar 31, 2024 at 12:01
  • The thing is no runtime profiling information are provided. You should do it to know how to optimize this kernel. This is critical. For example, if the prefetching window is fine, then there is not much more you can do except reordering items if you iterate multiple times on the same array. Commented Mar 31, 2024 at 12:04
  • Also note that gather instructions are not very efficiently implemented currently on all mainstream CPUs : it load each item in memory 1 by one so it is not really vectorized internally. It can be a bit better than a scalar loop for arrays fitting in the L1 cache because it reduces the number of load instructions. But in this specific case, prefetching instructions might be actually useless. Commented Mar 31, 2024 at 12:08
  • Does this article help? The distance parameter is the key here (the prefetching window). Commented Mar 31, 2024 at 12:11

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.