Context
I'm trying to write a piece of code in inline assembly, which processes all elements of a "small" array (say ~10 elements) as a fully-unrolled loop. I want to avoid falling into the usual trap of declaring it as asm volatile with a generic "memory" clobber.
Right now the code in question is for the ARMv7-M architecture, targeting the Cortex-M4 core, but I often need to do something similar for AArch64.
I have already the following Stack Overflow questions and answers:
How can I indicate that the memory pointed to by an inline ASM argument may be used?
Looping over arrays with inline assembly
Why inline assembly
I know inline assembly is not always the best solution to the problem. For the cases I envision to use this, I believe it may be the best choice:
- It is extremely performance critical;
- The compiler is doing an awful job of register allocation;
- The execution time of these assembly-language routines in are in the order of say ~100 cycles, so the cost of saving and restoring registers to adhere to the calling convention (if written directly in assembly, in a separate
.sfile) represents a significant portion of the execution time; - The code is used in a handful of places in the rest of my project, so I'm willing to pay the cost in code size to inline it in every one of these places;
- I don't have the resources to rewrite the functions that call into this code in assembly (if I did, I could just directly insert the assembly code and avoid paying for the cost of saving and restoring registers.) I also frankly think that, if I did, it would be bad engineering practice to do so. They're much more readable and easier to maintain as C functions.
The problem
These are the requirements I have, and so I'm looking for a way to set up my inline assembly constraints so as to meet all of them.
- My code is very near the limit of using all 14 available registers in the Cortex-M4 (while ARMv7-M has 16 registers, the SP and PC are of course reserved). I cannot afford to reserve registers that won't be actually used.
- My code accesses each element of the array using a base + offset addressing mode, e.g. instructions like
ldr r0, [r1, #16].
What I've tried
For the sake of example, the code will be used to implement a function with the following prototype:
void f(int out[10], const int in[10]);
Currently my code is written as such:
asm volatile(
"my assembly code block"
: [out] "=r" (out), /* possibly other output constraints */
: [in] "r" (in), /* possibly other input constraints */
: "memory"
);
Thus, I'm using the dreaded asm volatile + "memory" clobber.
Following the suggestion in this answer, to inform gcc of the actual addresses being accessed and taking out the volatile from asm and the "memory" clobber, I've tried rewriting it as such:
asm volatile(
"my assembly code block"
: [out] "=r" (out), "=m" (*(int(*)[10])out) /* possibly other output constraints */
: [in] "r" (in), "m" (*(const int(*)[10])in) /* possibly other input constraints */
);
However, gcc complains with errors such as this:
/.../file1:282:1: error: unable to find a register to spill
282 | }
| ^
/.../file1:282:1: error: this is the insn:
(insn 6985 6986 6983 5 (set (reg:SI 1979 [ t ])
(reg:SI 5177 [orig:1979 t ] [1979])) "/.../file2.c":83:5 759 {*thumb2_movsi_vfp}
(expr_list:REG_DEAD (reg:SI 5177 [orig:1979 t ] [1979])
(nil)))
This appears to be happening while trying to inline f (written in file2.c) into another function in file1.c.
I've also seen in some cases a message such as "impossible constraint in ‘asm’", which amazingly is solved by just bringing back the volatile keyword to the asm statement, but of course this isn't ideal.
My theory is that the "m" constraints may require a register to materialize, and since I'm already working at the limit of available registers, this is the straw that broke the camel's back.
If I take out one (say the input) memory ("m") constraint and bring back the "memory" clobber, this now works.
This (which I'm not even sure if it makes sense) also generates the same error:
asm volatile(
"my assembly code block"
: [out] "=r" (out), "=m" (*(int(*)[10])out) /* possibly other output constraints */
: [in] "rm" (*(const int(*)[10])in) /* possibly other input constraints */
);
Something which I also tried, and which made me suspect that the above code doesn't even make sense, is:
asm volatile(
"my assembly code block"
: [out] "=rm" (*(int(*)[10])out) /* possibly other output constraints */
: [in] "rm" (*(const int(*)[10])in) /* possibly other input constraints */
);
Now I get a truckload of errors like the following:
/var/folders/bg/8_8vh7ks6vq1t3mq4l3fswcc0000gn/T//ccTvGYHb.s:719: Error: ARM register expected -- `ldr.w fp,[[sp,#40],#28]'
This looks like the compiler is trying to use the address of out on the stack as the memory constraint. My problem is that I need to materialize it in a register so I can do base + offset addressing.
EDIT: I have found the "Q" constraint in the list of constraints for particular machines for ARM, which is described as such: "A memory reference where the exact address is in a single register (‘‘m’’ is preferable for asm statements)". My asm statement looks like this with this constraint:
asm volatile(
"my assembly code block"
: [out] "=Q" (*(int(*)[10])out) /* possibly other output constraints */
: [in] "Q" (*(const int(*)[10])in) /* possibly other input constraints */
);
This results in errors such as this:
/var/folders/bg/8_8vh7ks6vq1t3mq4l3fswcc0000gn/T//ccgymksW.s:40: Error: garbage following instruction -- `ldrd r3,r8,[[r1],#8]'
I feel like I'm getting close to my answer. The pointer I need has been materialized in a register, and apparently it's all down to essentially a formatting issue: since "Q" is still a kind of memory constraint, the register comes wrapped in brackets, i.e. [r1] rather than just r1. All I need is to strip these brackets, and I should be done.
The question
It appears that what I really need is some kind of constraint that simultaneously materializes a pointer in a register, while serving to inform gcc that I'm using a specific region of memory pointed to by that array, so that I don't need to use asm volatile + the "memory" clobber. In other words, I'm hoping there's some constraint X which I can replace in the code below that works for my case:
asm volatile(
"my assembly code block"
: [out] "=X" (*(int(*)[10])out) /* possibly other output constraints */
: [in] "X" (*(const int(*)[10])in) /* possibly other input constraints */
);
What is the proper constraint to replace for "X" here?
I won't rule out the possibility that this is an XY problem after all -- maybe what I need is not a magical constraint, but a completely different style of writing the constraints so that I don't run into this problem. Either way, I am open to any suggestions.
repne scasb(a loop over an array) to the compiler that way. Semi-related: Looping over arrays with inline assembly"Q"might help it realize that it can use the same register for the"r"(pointer)` and"Q"( *(array type)pointer)constraints, if it wasn't seeing that with"m". There might be a modifier to print just the bare register name instead of the[reg]addressing mode, but gcc.gnu.org/onlinedocs/gcc/… doesn't show one and I wouldn't be surprised if there isn't one.