Why do GCC and Clang fail to auto-vectorize simple loop?

Question

I have two functions counting the occurrences of a target char in the given input buffer. The functions vary only in how they communicate the result back to the caller; one returns the result and the other writes to a variable passed by reference.

#include <cstdlib>

#define BUF_LEN 0x1000

size_t check_count1(const char* buf, char target) {
    size_t count = 0;
    for (size_t i = 0; i < BUF_LEN; i++) {
        if (buf[i] == target) {
            count++;
        }
    }
    return count;
}

void check_count2(const char* buf, char target, size_t& count) {
    for (size_t i = 0; i < BUF_LEN; i++) {
        if (buf[i] == target) {
            count++;
        }
    }
}

I am puzzled by how Clang and GCC generate code for these two functions. The loop in check_count1 is vectorized, but for check_count2 it's not. Initially I thought this was due to pointer aliasing in the second case, but specifying __restrict has no effect. Here's the link to compiler explorer.

An older ICC compiler did just fine with both loops. What changed?

pointer aliasing would be my guess. Just one more reason to not use output parameters — NathanOliver
– NathanOliver, Commented Nov 12 at 16:36
@Jarod42 Here's an example with __restrict on both pointers: godbolt.org/z/oc94bxq71 — Devashish
– Devashish, Commented Nov 12 at 16:38
When if is dropped to have branchless code then it is also vectorized. — Marek R
– Marek R, Commented Nov 12 at 16:50

cpplearner · Accepted Answer · 2025-11-12 18:53:55Z

One reason is pointer aliasing, as NathanOliver points out in comment.

Another reason is that, in its current form, if buf[i] == target is false for every i, the modification of count needs to be skipped. This doesn't matter if count is a local variable, in which case an extra assignment to count is unobservable, but matters if count lives in rodata†, where modification is not allowed.

If the loop body is changed to always modify count, then GCC and Clang will vectorize. For example, this will also be vectorized:

void check_count2(const char* buf, char target, size_t& __restrict count) {
    for (size_t i = 0; i < BUF_LEN; i++) {
        if (buf[i] == target) {
            count++;
        } else {
            count++;
            count--;
        }
    }
}

As discussed in Crash with icc: can the compiler invent writes where none existed in the abstract machine?, ICC invents modifications even if it's not allowed to do so.

_{†: This can happen if count is initialized from const_cast<size_t&>(n), where n is a const global variable. Note that casting away constness is not undefined behavior: modifying const variable is.}

Denys Rtveliashvili · Accepted Answer · 2025-11-12 18:12:16Z

In the first case there is more flexibility for compiler to deal with the problem.

"count" is declared in the function and can be returned when the job is done. This means the compiler can keep it wherever it wishes, and it will likely keep it in a CPU register.

In the second case you pass a reference to the "count". That means a few of problems which make vectorization hard:

"count" is kept somewhere in RAM and the function got its address. Vectorized instructions would struggle with this anyway.
The compiler does not know your intentions so it has to play it safe. Do you intend the rest of the code to look at "count" after the job is done or do you want the changes to it to be potentially observable from the outside? The standard does not cover it clearly, so you are in the territory of undefined behaviour anyway. The old ICC compiler was probably making assumptions different to those that Clang and GCC make.
What if the array of "buf" and the location of "count" overlap? Technically, it is not impossible.

Collectives™ on Stack Overflow

Why do GCC and Clang fail to auto-vectorize simple loop?

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related