I'm currently working on optimizing a kernel, and one of the most time-consuming loops, despite optimization efforts, still accounts for 80% of the benchmark's execution time. The loop's performance bottleneck stems from its memory-bound nature, primarily due to an indirect memory access pattern that hampers prefetching. I'm seeking guidance on how to hint to the compiler to perform indirect prefetching without compromising other optimizations.
Here's the relevant snippet of the source code:
for (j = 0; j < lastrow - firstrow + 1; j++) {
sum = 0.0;
for (k = rowstr[j]; k < rowstr[j+1]; k++) {
sum = sum + a[k] * p[colidx[k]];
}
q[j] = sum;
}
The bottleneck arises from the indirect memory access pattern, particularly in accessing the p array at the colidx[k] position.
After compiling with the Intel compiler using the following flags:
-O3 -xhost -mcmodel=medium -qopt-prefetch=5 -xCOMMON-AVX512
It seems the compiler unrolled the loop by 8-factor, vectorized and added some prefetch instructions due to high qopt-prefetch level.
Block 1:
vmovupdy (%r11,%r10,4), %ymm15
kxnorw %k0, %k0, %k1
vpxord %zmm16, %zmm16, %zmm16
vgatherdpdq (%r9,%ymm15,8), %k1, %zmm16
vfmadd231pdz (%rbx,%r10,8), %zmm16, %zmm14
prefetcht0z 0x900(%rbx,%r10,8)
prefetcht0z 0x480(%r11,%r10,4)
add $0x8, %r10
cmp %rsi, %r10
jbe 0x403370 <Block 1>
I've experimented with hints and tried using the volatile keyword, but it seems like I'm forced to choose between vectorization and prefetching p[colidx[k+PREFETCH_FACTOR]]. However, I believe it's possible to leverage both for optimal performance.
Would appreciate any insights or suggestions on how to achieve this balance. Thank you!
colidx[k]are large and unpredictable androwstr[j+1]is big, then data will likely be stored in the RAM and I expect vectorization to have nearly no impact (even the choice of the instructions of the whole loop actually). Prefetching is indeed the key to improve performance, though the prefetching window is certainly not optimal. In fact, the best window is often dependent of where data is stored (e.g. L1, L2, L3, RAM). If it is stored in the L1, then optimizations matters and you can still optimize the kernel, if it is in DRAM, then only the prefetch window matters.distanceparameter is the key here (the prefetching window).