I want to compare two arrays and write the result to bitmap. Therefore, I tried the following code:
void wtf(int *lhs, int *rhs, int len, std::vector<bool>& dst){
for (int i = 0; i < len; i++){
dst[i] = lhs[i] == rhs[i];
}
}
In the compiler explorer, we can see that the above code is not vectorized. If we write the result to a boolean array, LLVM will vectorize it.
Why can the above code not be vectorized? What is the most efficient code to accomplish this task? I searched a lot and did not find the answer.
std::vector<bool>is a weird thing, and many people consider it to be a mistake. It's allowed to store bools as single bit (instead of using full byte), which makes is very clunky. No wonder compiler misses an optimisation here.vector<unsigned char>would vectorize fine, too.. See also en.cppreference.com/w/cpp/container/vector_boolvector<bool>is a bit-array, right? The C++ functions involved are a lot of code for a compiler to "see through". The actual asm needed on x86-64 ispcmpeqd/movmskpsto get 4 bits at a time, and some scalar shift / OR to combine into 8-bit chunks. (AVX2 with 32-byte vectors could get an 8-bit mask from 8 ints at once.) Or compare 4 vectors and combine with 2xpackssdw/ 1xpacksswbbeforepmovmskbto get one 16-bit mask result. But IDK what chunk size thevector<bool>specialization uses internally. (no luck wth x86-64-v4 or GCC: godbolt.org/z/b4PKnWbzT)vector<bool>is a terrible choice of name for it, though; that's the mistake. x86-64 SIMD (SSE2 and AVX2) can very naturally and efficiently generate bitmaps from SIMD compares, so in theory it's an efficient data structure. But in practice the C++ wrappers (both libstdc++ and libc++) are too complex for auto-vectorization to see through, it seems.std::vector<bool>is fairly unpopular. You might have more luck auto-vectorizing a loop that works in chunks of 8, 16, or 32 elements, building up a mask in auint32_tand doing one assignment through a pointer, instead of many separate RMWs with complicated indexing to break down a bit-index into pointer and shift count. IIRC, there have been some previous SO Q&As where people had some success at getting something to auto-vectorize to apmovmskbormovmskps.