Fastest Offset Read for a Small Array

Question

For speed, I would like to read one of 8 registers referenced by the value in a 9th register. The fastest way I see to do this is to use 3 conditional jumps (checking 3 bits in the 9th register). This should have shorter latency than the standard way of doing this with an offset memory read, but this still requires at least 6 clock cycles (at least one test plus one conditional jmp per bit check).

Is there any commercial CPU (preferably x86/x64) with an intrinsic to do this "offset register read" with a latency of just one clock cycle?

In theory, an optimized CPU could do this with one addition and one move, so two or one clock cycles seems easy...is there some general reason that architectures don't care about speeding up an offset read for a small array?

Treating the CPU registers as an array is really not a common approach these days. The last architecture I know that allowed this was the PDP11 and it died out in the late 80s. Why don't you put your array into some memory location like any other array? — fuz
– fuz, Commented Feb 3, 2020 at 14:57
Accessing a register by variable index more or less defeats register renaming, so it's a very unlikely feature for a fast processor. You can sort of index an YMM register using VPERMD, maybe that's still OK for your use case? — user555045
– user555045, Commented Feb 3, 2020 at 14:59
@fuz The AVR can do this too; the registers are mapped into the address space. — Erlkoenig
– Erlkoenig, Commented Feb 3, 2020 at 15:01
CPUs don't have intriniscs; compilers have intrinsics. Perhaps you mean "machine instruction" or "addressing mode" (which a compiler might have an intrinsic for)? As previous commenters say, most ISAs can't index the registers with a runtime index, only with indexes embedded in the machine code of an instruction. (So register fetch could happen any time after decode, without any extra indirection as part of the ISA.) — Peter Cordes
– Peter Cordes, Commented Feb 3, 2020 at 15:05
Self modifying code is much more expensive that merely flushing a line from the data cache; on modern x86 it flushes the whole pipeline (perf counter event machine_clears.smc). Fine if you JIT once and run many times, total garbage if you do it every time. Store-forwarding latency is only about 3 to 5 cycles on modern Intel, similar to L1d load-use latency of 4 or 5 cycles. What's the larger problem you're trying to solve where you need low latency array indexing? Can you solve it with AVX2 shuffles instead? That's 3c latency (plus more for vmovd of an index from integer regs) — Peter Cordes
– Peter Cordes, Commented Feb 3, 2020 at 15:11

fuz · Accepted Answer · 2023-01-28 03:05:39Z

4

Treating the CPU registers as an array is really not a common approach these days. The last architecture I know that allowed this was the PDP11 and it died out in the late 80s. Why don't you put your array into some memory location like any other array?

That said, you could use a computed jump. This also replaces a data dependency (indexed addressing mode) with a control dependency so out-of-order exec doesn't have to wait for the index input to even be ready before it can start running code that uses the final RAX. Of course this assumes correct branch prediction, which is unlikely if the index changes often. A branch mispredict costs many cycles of little work being done, but the small latency of a load that hits in L1d cache can overlap with independent work very easily.

The throughput cost is higher than an array in memory: some address computations, one jump, one move and a ret, instead of just a mov or even a memory operand with an indexed addressing mode.

To inline this code, simply replace the jmp *%rax with a call *%rax, costing another uop. Or replace the ret instructions with a jmp to a label at the bottom and increase the stride of the jump table to 8 to account for the longer encoding.

    # select a register from r8...r15 according to the value in rdi
select:
    lea labels-4*8(%rip),%rax # rdi = 8 is the first jump table entry
    lea (%rax,%rdi,4),%rax    # pointer to the appropriate entry
    jmp *%rax                 # computed jump

    .align 4
labels:
    mov %r8, %rax
    ret

    .align 4
    mov %r9, %rax
    ret

    .align 4
    mov %r10, %rax
    ret

    .align 4
    mov %r11, %rax
    ret

    .align 4
    mov %r12, %rax
    ret

    .align 4
    mov %r13, %rax
    ret

    .align 4
    mov %r14, %rax
    ret

    .align 4
    mov %r15, %rax
    ret

While this is probably faster than three conditional jumps (depending on access pattern), it surely won't beat just using an array.

You may also be able to use code like this, assuming the index is in eax. This works by copying the index bits into CF, SF, and PF and then using a bunch of ALU operations to distinguish them:

    imul $0x4100, %eax, %eax
    lahf

    # bit 0
    mov %r8, %rax
    cmovc %r9, %rax
    mov %r10, %rcx
    cmovc %r11, %rcx
    mov %r12, %rdx
    cmovc %r13, %rdx
    mov %r14, %rbx
    cmovc %r15, %rbx

    # bit 1
    cmovs %rcx, %rax
    cmovs %rbx, %rdx

    # bit 2
    cmovp %rdx, %rax

The result obtains in %rax. Due to the high instruction-level parallelism and lack of branches in this code, it should perform better than the code above, unless the index is almost always the same.

(stolen from this answer).

edited Jan 28, 2023 at 3:05

answered Feb 3, 2020 at 15:16

fuz

94.7k27 gold badges216 silver badges391 bronze badges

Sign up to request clarification or add additional context in comments.

18 Comments

bobuhito Over a year ago

Good idea. I wish there were a C++ way which generated assembly like this. Also, I can't believe (yet?) that this can be slower than L1 cache for random access.

fuz Over a year ago

@bobuhito Agner's tables say so for Skylake X. Your mileage may vary. If you told us what algorithm you try to implement using this approach, we might be able to propose even better solutions that break out of the “using registers as arrays” idea.

Peter Cordes Over a year ago

@bobuhito: yes, L1d load-use latency is 4 cycles (best case) on SnB-family. Or 5 cycles outside of the pointer-chasing special case for simple addressing modes where the value came from another load. But the core doesn't stall for those 5 cycles!! Almost always there's some independent work to do. The throughput of loads is 2 per cycle (on modern x86, AMD and Intel).

fuz Over a year ago

@bobuhito Hm... I'd use a state machine for that. What do you plan to do with the other values that do not increment? Clearly, this is not the whole picture. Perhaps post some pseudo code of the entire operation you want to implement (and not just the tiny part you want to use indirect addressing for).

fuz Over a year ago

@bobuhito Your problem description too is a bit to abstract for me to work with. Best would be if you could add a self-contained piece of code to your question that demonstrates what you are trying to achieve. I can then try and rewrite it into fast assembly. If all you want to do is performing some simple calculations, consider if you actually need to perform all 256 iterations or if you can use some sum formula instead.

|

Collectives™ on Stack Overflow

Fastest Offset Read for a Small Array

1 Answer 1

18 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

18 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related