1

I have the following logic:

struct Range {
  int start;
  int end;
};

bool prev = false;
Range range;
std::vector<Range> result;
for (int i = 0; i < n; i++) {
  bool curr = ...; // this is random and causes prediction misses
  if (curr && !prev)
    range.start = i;
  if (!curr && prev){
    range.end = i-1;
    result.push_back(range);
  }
  prev = curr;
}
// handles edge case at the end

How can I remove branching in this code? The branching pattern is like run-length encoding, but we're storing start/end of runs of true1.

I tried storing the boolean results as 1s and 0s so that std::adjacent_difference could give me an array of ones and zeros, where a one corresponds to a start and a zero corresponds to an end, but how do I construct results from here?

Footnote 1: classic RLE would store narrow (e.g. 8-bit) lengths for both false and true, so position is implicit. But also an explicit value so long runs could be encoded as multiple runs of 255, or 127 if we used 1-byte entries with one bit holding the true/false value and the other 7 holding a length.

8
  • Just FYI, this code as-is won't do what appears to be intended either, if i - 1 results in curr = true, the last range will not be added to the result vector. Commented Sep 18, 2023 at 23:40
  • 2
    Always a good idea to have a little blurb explaining what you really want the code to do in case there's a cheap hack you totally overlooked that gets the job done. Commented Sep 18, 2023 at 23:53
  • 1
    If the data is really random, then this isn't really optimizable. If the data tends to have runs, then it is Commented Sep 18, 2023 at 23:55
  • 2
    You should consider reserving capacity for the vector, either with a reasonable guess, or a blatant over-estimation (such as n). Worrying about branch prediction seems a bit strange when you have reallocations happening in the loop. Commented Sep 18, 2023 at 23:58
  • 1
    What I'm talking about is the possibility of something above this algorithm that could be done better, possibly avoiding the algorithm all together. Commented Sep 19, 2023 at 0:05

3 Answers 3

1

If the data is really random, then this isn't really optimizable. If the data tends to have runs, then it might be optimized like so:

result.reserve(lots);
int start = 0;
int i = 0;
while(true) { //this loop is unconditional. no branch
  while(i<n && !getCur(i)) //find next start. only branch miss should be when it finds a start
    i++;
  start = i;
  while(i<n && getCur(i)) //find next end. only brach miss should be when it finds an end.
    i++;
  if (i >= n) break;
  result.emplace_back({start, i-i});
}

https://godbolt.org/z/qrevovzna Both clang and GCC implemented each loop with 7-8 ops. They each have two branches (One for i<n and one for the result of getCur to either continue or exit the loop) but if your data has runs, then those would have a very high prediction rate.

This variant may incur function call overhead, depending on the complexity of getCur. If getCur is inlined, then it may require ~2x as much space in the instruction cache, so might not get the expected gains.


I tried storing the boolean results as 1s and 0s so that std::adjacent_difference

That just moves the branching elsewhere, and doesn't actually improve anything.

Sign up to request clarification or add additional context in comments.

Comments

1

A technique you can use is to unconditionally store a Range to the result in every iteration, but only conditionally increment the index at which you store it. That conditional increment can be performed with a conditional move or plain arithmetic (of course there is no guarantee that it will be, but it's at least possible, check the resulting assembly code), unlike a conditional push_back. Ranges that wouldn't be stored in the original code, are now stored but overwritten in the next iteration. Ranges that would be stored, are kept by incrementing the index so that the next store doesn't overwrite it.

In order to do this, the vector needs sufficient size first (it doesn't need to be quite as high as n I think, but you can figure out those details), otherwise the writes are out of bounds. You can set the size to the actual size in the end.

For example,

#include <vector>
#include <stddef.h>

struct Range {
  int start;
  int end;
};

extern bool get_bool();

std::vector<Range> do_thing(size_t n)
{
    bool prev = false;
    Range range;
    std::vector<Range> result;
    result.resize(n);
    size_t index = 0;
    for (size_t i = 0; i < n; i++) {
        bool curr = get_bool();
        int mask = -(curr && !prev);
        range.start ^= (range.start ^ i) & mask;
        //range.start = (range.start & ~mask) | (i & mask); // alternative
        range.end = i-1;
        result[index] = range; // always store a range
        index += !curr && prev;
        prev = curr;
    }
    result.resize(index);
    return result;
}

But is it branchless?. Apparently yes. Clang is a bit more forgiving in this respect than GCC and also made branchless code if we don't use manual arithmetic tricks but the manual trickery was needed for GCC.

This also may not be more efficient, depending on how predictable the branches that are avoided this way were.

Clang 15 apparently turned the conditional update of the start into this:

    test    al, al
    mov     ecx, ebp
    cmovne  ecx, ebx
    lea     edx, [rbx - 1]  ; the result of this lea is used later, not relevant for this snippet
    test    r13b, r13b
    cmove   ebp, ecx

So using conditional moves instead of branches.

And the conditional increment of the index into:

    setne   cl
    mov     edx, eax
    xor     dl, 1
    and     dl, cl
    movzx   ecx, dl
    add     r12, rcx

Depending on where the values of curr come from it may even be possible to use SIMD for this. Here's a sketch:

The start indices and end indices can be filtered out separately and "compressed" (keeping only the indices for which curr && !prev or !curr && prev respectively) separately, then they will naturally "pair up". The locations with a start or end could be found by simple mask manipulation, use vpcompressd (with AVX512) or suitable emulation based on pshufb (pre-AVX512) or whatever your SIMD ISA of choice exposes to "compress" (aka left pack) the indices, do an unaligned store, and increase the destination pointer by std::popcount(mask).

6 Comments

Per How can I perform a branchless conditional arithmetic operation in C? - if(...) index++ could be written as index += !curr && prev, or perhaps index = (!curr && prev) ? index+1 : index to perhaps encourage LEA / CMOV instead of xor-zero + SETCC + ADD? But for critical path latency, SETCC + ADD is better.
"left-pack" is common terminology for what vpcompressd does, as in AVX2 what is the most efficient way to pack left based on a mask? .
If you don't need to be totally branchless, you do less branching by using your SIMD idea but without the compress. From a few vectors of input data, vpmovmskb or vmovmskps to get bitmaps of curr, and do changemask = curr ^ (curr>>1) or something to get a mask of where it changes (where prev != curr). Then iterate over the set bits in that, with mask &= mask - 1 (x86 blsi), using pos = chunk_base + std::countr_zero(mask). The loop branch would probably mispredict once per chunk, when it exits, but that's better than mispredicting every input element.
@ra1nsq I wouldn't do that, it's in a loop carried dependency and there would be a lot of latency that way. But I'll add some alternative.
@ra1nsq if you're interested in the SIMD approach, I can work it out more if you want. For that I'd like to know where the booleans come from, realistically it needs to be SIMDified too in order to be worth doing. I'm not opposed to doing these arithmetic tricks but SIMD can probably slay it in terms of performance.
|
0

Not sure if this is branchless the way you want, but it would eliminate the need for using any ifs while iterating:

struct Range {
  int start;
  int end;
};

std::vector<Range> result;
if (n > 0) {
  result.reserve(n);
  Range range{0,0};

  // using std::function with capturing lambdas would be cleaner,
  // but since the purpose of this exercise seems to be about
  // optimizing, using plain function pointers instead...
  using vec = std::vector<Range>;
  using func = void(*)(vec&, Range&, int);
  func actions[2] = {
    +[](vec&, Range &r, int) {},
    +[](vec &v, Range &r, int curr) {
      r.end = curr - 1;
      v.push_back(r);
      r.start = curr;
    }
  };

  bool curr = ...; // get 1st value
  for (int i = 1; i < n; ++i) {
    bool val = ...; // get next value
    actions[val != curr](result, range, i);
    curr = val;
  }

  range.end = n - 1;
  result.push_back(range);
}

// use result as needed...

Online Demo

But really, that's just an ugly way to avoid using 1 if in the loop, which honestly probably isn't worth trying to avoid, IMHO:

struct Range {
  int start;
  int end;
};

std::vector<Range> result;
if (n > 0) {
  result.reserve(n);
  Range range{0,0};

  bool curr = ...; // get 1st value
  for (int i = 1; i < n; ++i) {
    bool val = ...; // get next value
    if (val != curr) {
      range.end = i - 1;
      result.push_back(range);
      range.start = i;
    }
    curr = val;
  }

  range.end = n - 1;
  result.push_back(range);
}

Online Demo

Don't over-complicate your code if you don't have to.

1 Comment

that's just an ugly way to avoid using 1 if in the loop, - indeed, if the compiler doesn't see through this array of function-pointers and compile it the same as an if, an indirect function call will typically perform worse than a conditional branch. Especially if there's extra load-use latency from L1d cache before it can detect mispredicts. (e.g. about 5 cycles from having an index to having a function pointer loaded on modern x86, making branch mispredicts cost that much extra wasted work on top of the typical 15 to 20 cycle recovery time.)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.