How to avoid branching when finding runs of the same value and storing as a range (like run-length encoding)?

Question

I have the following logic:

struct Range {
  int start;
  int end;
};

bool prev = false;
Range range;
std::vector<Range> result;
for (int i = 0; i < n; i++) {
  bool curr = ...; // this is random and causes prediction misses
  if (curr && !prev)
    range.start = i;
  if (!curr && prev){
    range.end = i-1;
    result.push_back(range);
  }
  prev = curr;
}
// handles edge case at the end

How can I remove branching in this code? The branching pattern is like run-length encoding, but we're storing start/end of runs of true¹.

I tried storing the boolean results as 1s and 0s so that std::adjacent_difference could give me an array of ones and zeros, where a one corresponds to a start and a zero corresponds to an end, but how do I construct results from here?

Footnote 1: classic RLE would store narrow (e.g. 8-bit) lengths for both false and true, so position is implicit. But also an explicit value so long runs could be encoded as multiple runs of 255, or 127 if we used 1-byte entries with one bit holding the true/false value and the other 7 holding a length.

Just FYI, this code as-is won't do what appears to be intended either, if i - 1 results in curr = true, the last range will not be added to the result vector. — Patrick Roberts
– Patrick Roberts, Commented Sep 18, 2023 at 23:40
Always a good idea to have a little blurb explaining what you really want the code to do in case there's a cheap hack you totally overlooked that gets the job done. — user4581301
– user4581301, Commented Sep 18, 2023 at 23:53
If the data is really random, then this isn't really optimizable. If the data tends to have runs, then it is — Mooing Duck
– Mooing Duck, Commented Sep 18, 2023 at 23:55
You should consider reserving capacity for the vector, either with a reasonable guess, or a blatant over-estimation (such as n). Worrying about branch prediction seems a bit strange when you have reallocations happening in the loop. — paddy
– paddy, Commented Sep 18, 2023 at 23:58
What I'm talking about is the possibility of something above this algorithm that could be done better, possibly avoiding the algorithm all together. — user4581301
– user4581301, Commented Sep 19, 2023 at 0:05

Mooing Duck · Accepted Answer · 2023-09-19 00:22:35Z

If the data is really random, then this isn't really optimizable. If the data tends to have runs, then it might be optimized like so:

result.reserve(lots);
int start = 0;
int i = 0;
while(true) { //this loop is unconditional. no branch
  while(i<n && !getCur(i)) //find next start. only branch miss should be when it finds a start
    i++;
  start = i;
  while(i<n && getCur(i)) //find next end. only brach miss should be when it finds an end.
    i++;
  if (i >= n) break;
  result.emplace_back({start, i-i});
}

https://godbolt.org/z/qrevovzna Both clang and GCC implemented each loop with 7-8 ops. They each have two branches (One for i<n and one for the result of getCur to either continue or exit the loop) but if your data has runs, then those would have a very high prediction rate.

This variant may incur function call overhead, depending on the complexity of getCur. If getCur is inlined, then it may require ~2x as much space in the instruction cache, so might not get the expected gains.

I tried storing the boolean results as 1s and 0s so that std::adjacent_difference

That just moves the branching elsewhere, and doesn't actually improve anything.

user555045 · Accepted Answer · 2023-09-19 02:06:26Z

1

A technique you can use is to unconditionally store a Range to the result in every iteration, but only conditionally increment the index at which you store it. That conditional increment can be performed with a conditional move or plain arithmetic (of course there is no guarantee that it will be, but it's at least possible, check the resulting assembly code), unlike a conditional push_back. Ranges that wouldn't be stored in the original code, are now stored but overwritten in the next iteration. Ranges that would be stored, are kept by incrementing the index so that the next store doesn't overwrite it.

In order to do this, the vector needs sufficient size first (it doesn't need to be quite as high as n I think, but you can figure out those details), otherwise the writes are out of bounds. You can set the size to the actual size in the end.

For example,

#include <vector>
#include <stddef.h>

struct Range {
  int start;
  int end;
};

extern bool get_bool();

std::vector<Range> do_thing(size_t n)
{
    bool prev = false;
    Range range;
    std::vector<Range> result;
    result.resize(n);
    size_t index = 0;
    for (size_t i = 0; i < n; i++) {
        bool curr = get_bool();
        int mask = -(curr && !prev);
        range.start ^= (range.start ^ i) & mask;
        //range.start = (range.start & ~mask) | (i & mask); // alternative
        range.end = i-1;
        result[index] = range; // always store a range
        index += !curr && prev;
        prev = curr;
    }
    result.resize(index);
    return result;
}

But is it branchless?. Apparently yes. Clang is a bit more forgiving in this respect than GCC and also made branchless code if we don't use manual arithmetic tricks but the manual trickery was needed for GCC.

This also may not be more efficient, depending on how predictable the branches that are avoided this way were.

Clang 15 apparently turned the conditional update of the start into this:

    test    al, al
    mov     ecx, ebp
    cmovne  ecx, ebx
    lea     edx, [rbx - 1]  ; the result of this lea is used later, not relevant for this snippet
    test    r13b, r13b
    cmove   ebp, ecx

So using conditional moves instead of branches.

And the conditional increment of the index into:

    setne   cl
    mov     edx, eax
    xor     dl, 1
    and     dl, cl
    movzx   ecx, dl
    add     r12, rcx

Depending on where the values of curr come from it may even be possible to use SIMD for this. Here's a sketch:

The start indices and end indices can be filtered out separately and "compressed" (keeping only the indices for which curr && !prev or !curr && prev respectively) separately, then they will naturally "pair up". The locations with a start or end could be found by simple mask manipulation, use vpcompressd (with AVX512) or suitable emulation based on pshufb (pre-AVX512) or whatever your SIMD ISA of choice exposes to "compress" (aka left pack) the indices, do an unaligned store, and increase the destination pointer by std::popcount(mask).

edited Sep 19, 2023 at 2:06

answered Sep 19, 2023 at 0:02

user555045

65.8k6 gold badges95 silver badges180 bronze badges

6 Comments

Peter Cordes Over a year ago

Per How can I perform a branchless conditional arithmetic operation in C? - if(...) index++ could be written as index += !curr && prev, or perhaps index = (!curr && prev) ? index+1 : index to perhaps encourage LEA / CMOV instead of xor-zero + SETCC + ADD? But for critical path latency, SETCC + ADD is better.

Peter Cordes Over a year ago

"left-pack" is common terminology for what vpcompressd does, as in AVX2 what is the most efficient way to pack left based on a mask? .

Peter Cordes Over a year ago

If you don't need to be totally branchless, you do less branching by using your SIMD idea but without the compress. From a few vectors of input data, vpmovmskb or vmovmskps to get bitmaps of curr, and do changemask = curr ^ (curr>>1) or something to get a mask of where it changes (where prev != curr). Then iterate over the set bits in that, with mask &= mask - 1 (x86 blsi), using pos = chunk_base + std::countr_zero(mask). The loop branch would probably mispredict once per chunk, when it exits, but that's better than mispredicting every input element.

user555045 Over a year ago

@ra1nsq I wouldn't do that, it's in a loop carried dependency and there would be a lot of latency that way. But I'll add some alternative.

user555045 Over a year ago

@ra1nsq if you're interested in the SIMD approach, I can work it out more if you want. For that I'd like to know where the booleans come from, realistically it needs to be SIMDified too in order to be worth doing. I'm not opposed to doing these arithmetic tricks but SIMD can probably slay it in terms of performance.

|

Remy Lebeau · Accepted Answer · 2023-09-19 02:05:46Z

0

Not sure if this is branchless the way you want, but it would eliminate the need for using any ifs while iterating:

struct Range {
  int start;
  int end;
};

std::vector<Range> result;
if (n > 0) {
  result.reserve(n);
  Range range{0,0};

  // using std::function with capturing lambdas would be cleaner,
  // but since the purpose of this exercise seems to be about
  // optimizing, using plain function pointers instead...
  using vec = std::vector<Range>;
  using func = void(*)(vec&, Range&, int);
  func actions[2] = {
    +[](vec&, Range &r, int) {},
    +[](vec &v, Range &r, int curr) {
      r.end = curr - 1;
      v.push_back(r);
      r.start = curr;
    }
  };

  bool curr = ...; // get 1st value
  for (int i = 1; i < n; ++i) {
    bool val = ...; // get next value
    actions[val != curr](result, range, i);
    curr = val;
  }

  range.end = n - 1;
  result.push_back(range);
}

// use result as needed...

Online Demo

But really, that's just an ugly way to avoid using 1 if in the loop, which honestly probably isn't worth trying to avoid, IMHO:

struct Range {
  int start;
  int end;
};

std::vector<Range> result;
if (n > 0) {
  result.reserve(n);
  Range range{0,0};

  bool curr = ...; // get 1st value
  for (int i = 1; i < n; ++i) {
    bool val = ...; // get next value
    if (val != curr) {
      range.end = i - 1;
      result.push_back(range);
      range.start = i;
    }
    curr = val;
  }

  range.end = n - 1;
  result.push_back(range);
}

Online Demo

Don't over-complicate your code if you don't have to.

answered Sep 19, 2023 at 2:05

Remy Lebeau

609k36 gold badges516 silver badges875 bronze badges

1 Comment

Peter Cordes Over a year ago

that's just an ugly way to avoid using 1 if in the loop, - indeed, if the compiler doesn't see through this array of function-pointers and compile it the same as an if, an indirect function call will typically perform worse than a conditional branch. Especially if there's extra load-use latency from L1d cache before it can detect mispredicts. (e.g. about 5 cycles from having an index to having a function pointer loaded on modern x86, making branch mispredicts cost that much extra wasted work on top of the typical 15 to 20 cycle recovery time.)

Collectives™ on Stack Overflow

How to avoid branching when finding runs of the same value and storing as a range (like run-length encoding)?

3 Answers 3

Comments

6 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

6 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related