A technique you can use is to unconditionally store a Range to the result in every iteration, but only conditionally increment the index at which you store it. That conditional increment can be performed with a conditional move or plain arithmetic (of course there is no guarantee that it will be, but it's at least possible, check the resulting assembly code), unlike a conditional push_back. Ranges that wouldn't be stored in the original code, are now stored but overwritten in the next iteration. Ranges that would be stored, are kept by incrementing the index so that the next store doesn't overwrite it.
In order to do this, the vector needs sufficient size first (it doesn't need to be quite as high as n I think, but you can figure out those details), otherwise the writes are out of bounds. You can set the size to the actual size in the end.
For example,
#include <vector>
#include <stddef.h>
struct Range {
int start;
int end;
};
extern bool get_bool();
std::vector<Range> do_thing(size_t n)
{
bool prev = false;
Range range;
std::vector<Range> result;
result.resize(n);
size_t index = 0;
for (size_t i = 0; i < n; i++) {
bool curr = get_bool();
int mask = -(curr && !prev);
range.start ^= (range.start ^ i) & mask;
//range.start = (range.start & ~mask) | (i & mask); // alternative
range.end = i-1;
result[index] = range; // always store a range
index += !curr && prev;
prev = curr;
}
result.resize(index);
return result;
}
But is it branchless?. Apparently yes. Clang is a bit more forgiving in this respect than GCC and also made branchless code if we don't use manual arithmetic tricks but the manual trickery was needed for GCC.
This also may not be more efficient, depending on how predictable the branches that are avoided this way were.
Clang 15 apparently turned the conditional update of the start into this:
test al, al
mov ecx, ebp
cmovne ecx, ebx
lea edx, [rbx - 1] ; the result of this lea is used later, not relevant for this snippet
test r13b, r13b
cmove ebp, ecx
So using conditional moves instead of branches.
And the conditional increment of the index into:
setne cl
mov edx, eax
xor dl, 1
and dl, cl
movzx ecx, dl
add r12, rcx
Depending on where the values of curr come from it may even be possible to use SIMD for this. Here's a sketch:
The start indices and end indices can be filtered out separately and "compressed" (keeping only the indices for which curr && !prev or !curr && prev respectively) separately, then they will naturally "pair up". The locations with a start or end could be found by simple mask manipulation, use vpcompressd (with AVX512) or suitable emulation based on pshufb (pre-AVX512) or whatever your SIMD ISA of choice exposes to "compress" (aka left pack) the indices, do an unaligned store, and increase the destination pointer by std::popcount(mask).
i - 1results incurr = true, the last range will not be added to theresultvector.n). Worrying about branch prediction seems a bit strange when you have reallocations happening in the loop.