EDIT x 2
- Added more comprehensive function which returns an abstract register class: the function outputs a register full of floats. I don't care the actual length - SSE, AVX... - because Google Highway will figure that out for me.
- Working code available on Godbolt
test1,test2,test3could also be masks
EDIT
- Returned values are not known at compile time (these trivial values are there just to make the branching immediate to understand)
- I fixed the branching that was so immediate to understand that I got it wrong (:
I'm new to SIMD and I'm using Google Highway to achieve a portable (x86 and ARM) solution, so I'm writing this question in general terms.
I'm trying to speedup this C/C++ code with SIMD instructions
const bool test1 = foo(input1) > 0; // unpredictable
const bool test2 = foo(input2) > 0; // unpredictable
const bool test3 = foo(input3) > 0; // unpredictable
const RegisterWithFourFloats out0; // not known at compile time
const RegisterWithFourFloats out1; // not known at compile time
const RegisterWithFourFloats out12; // not known at compile time
const RegisterWithFourFloats out13; // not known at compile time
const RegisterWithFourFloats out123;// not known at compile time
if (test1)
if(test2)
if (test3)
return out123;
else
return out12;
else
if (test3)
return out13;
else
return out1;
else
return out0;
So the keys are
test 1 |
test 2 |
test 3 |
returned value | mask name |
|---|---|---|---|---|
| 0 | 0 | 0 | 0 | mask1 |
| 1 | 1 | 1 | 123 | mask2 |
| 1 | 0 | 1 | 13 | mask3 |
| 1 | 1 | 0 | 12 | mask4 |
| 1 | 0 | 0 | 1 | mask5 |
I hope this question is clear and I'm very happy to improve it. I guess the solution should be a general SIMD strategy that I can test.
I have tried two solutions that return the correct output but are slower:
- Flatten the IF/ELSE cascade by removing nested IF/ELSE. This creates more jump instructions and therefore gives poor performance.
- Go branch free with
IfThenElsefunction that basically creates AND/OR masks:
Vec result = [test1, test2, test3]
output = IfThenElse( Xor( mask1, result), 0, output)
output = IfThenElse( Xor( mask2, result), 123, output)
output = IfThenElse( Xor( mask3, result), 12, output)
output = IfThenElse( Xor( mask4, result), 13, output)
output = IfThenElse( Xor( mask5, result), 1, output)
return output;
I took a bit of a shortcut here to keep the question concise, but the idea of The Xor operator is that it results in a mask with true values IFF maskX is block-wise equal to result. Therefore, the value of output is updated only when maskX is block-wise equal to result. The result is correct, but the runtime cost higher.
testvalues (i.e. 0 or 1) by their value (1, 2, or 3) and sum the results?_mm_and_si128(bitwise AND with a mask whose elements are either 0 or -1). SIMD compares produce masks precisely to enable this kind of use. Or AVX-512 produces bitmasks but allows using them to mask vector ops. But anyway no, that doesn't work here.1+2+3= 6, not123. And since we need13or12instead of103or120, we can't just use100,20and3as the values.pshufbor ARMtblavailable, in this case instead of masking and blending many different ways, you could combine three mask vectors likemask1 | (mask2 << 1) | (mask3 << 2). If they're normal masks (all-one / all-zero bits) then maybepabsbabsolute value the result of combining them, and combine with addition instead of OR, since you need the high bit clear. Then usepshufbas a lookup table for 16 bytes in parallel, from a table of 16 bytes. The high bytes of each 32-bit element get zeroed since your000key maps to0.if(test2){}else{}are the same. I assume the table is correct.test3beforetest2. Are the actual return values fixed (known at compile time or outside the critical loop)?