GCC inline assembly cannot load local variable's address into register in x64 Intel format?

Question

I am quite used to Intel-format inline assembly. Does anyone knows how to convert the two AT&T lines into Intel format in the code below? It is basically loading local variable's address into a register.

int main(int argc, const char *argv[]){
    float x1[256];
    float x2[256];

    for(int x=0; x<256; ++x){
        x1[x] = x;
        x2[x] = 0.5f;
    }

    asm("movq %0, %%rax"::"r"(&x1[0])); // how to convert to Intel format?
    asm("movq %0, %%rbx"::"r"(&x2[0])); // how to convert to Intel format?

    asm(".intel_syntax noprefix\n"
        "mov rcx, 32\n"
"re:\n"
        "vmovups ymm0, [rax]\n"
        "vmovups ymm1, [rbx]\n"
        "vaddps ymm0, ymm0, ymm1\n"
        "vmovups [rax], ymm0\n"
        "add rax, 32\n"
        "add rbx, 32\n"
        "loopnz re"
    );
}

Specifically, loading on-stack local variables using mov eax, [var_a] is allowed when compiled in 32-bit mode. For example,

// a32.cpp
#include <stdint.h>
extern "C" void f(){
    int32_t a=123;
    asm(".intel_syntax noprefix\n"
        "mov eax, [a]"
    );
}

It compiles well:

xuancong@ubuntu:~$ rm -f a32.so && g++-7 -mavx -fPIC -masm=intel -shared -o a32.so -m32 a32.cpp && ls -al a32.so
-rwxr-xr-x 1 501 dialout 6580 Aug 28 09:26 a32.so

However, the same syntax is not allowed when compiled in 64-bit mode:

// a64.cpp
#include <stdint.h>
extern "C" void f(){
    int64_t a=123;
    asm(".intel_syntax noprefix\n"
        "mov rax, [a]"
    );
}

It does not compile:

xuancong@ubuntu:~$ rm -f a64.so && g++-7 -mavx -fPIC -masm=intel -shared -o a64.so -m64 a64.cpp && ls -al a64.so
/usr/bin/ld: /tmp/cclPNMoq.o: relocation R_X86_64_32S against undefined symbol `a' can not be used when making a shared object; recompile with -fPIC
/usr/bin/ld: final link failed: Nonrepresentable section on output
collect2: error: ld returned 1 exit status

So is there some way to make this work without using input:output:clobber, because simple local variables or function arguments can be accessed directly via mov rax, [rsp+##] or mov rax, [rbp+##] without clobbering other registers?

Unfortunately your example code demonstrates a number of the reasons why you shouldn't use inline assembly. Aside from the errors that will result from your broken code, your assembly code is less optimal then what a compiler can generate. For such a simple loop you can rely on auto-vectorization optimzations. For more complex code you can use intrinsics instead. — Ross Ridge
– Ross Ridge, Commented Aug 24, 2020 at 16:15
There are many things wrong here. The lockless tutorial is very good, though quite dense to work through. The documentation for gcc's extended asm has improved greatly with subsequent releases. If you provided pseudo-code of what you're trying to achieve - preferable with well-documented C code - you might get more help. It's quite possible you may achieve better results with vector intrinsics! — Brett Hale
– Brett Hale, Commented Aug 24, 2020 at 18:47
Why would you ever manually vectorize with AVX but then use a slow loopnz instruction?!! Especially when it makes no sense to test for add rbx, 32 having set ZF, just dec ecx / jnz would be the sane option. If this is the kind of asm you're writing by hand, you should really just switch to intrinsics. As well as inefficient, this is super broken because you don't declare clobbers on registers you modify, or tell the compiler about the memory you read and write. Expect this to break, especially if compiled with optimization enabled. — Peter Cordes
– Peter Cordes, Commented Aug 24, 2020 at 21:37
To direct some comments about the original question "mov eax, [a]" doesn't generate the code you think it does. If you were to review the generated code you would discover that it generated a mov from a memory operand that wasn't on the stack. GCC inline assembly doesn't support accessing variables directly. The GCC manual has a warning it isn't supported even for global variable (not on the stack). I assume from your question that you may have been developing on MSVC using Microsoft's inline assembly? — Michael Petch
– Michael Petch, Commented Aug 28, 2020 at 3:37
If you were to use clang++ and the -fms-extensions option you would be able to generate accesses to variable outside the inline assembly. You'd code it like MSVC: asm { mov rax, [a] } . GCC doesn't support MS extensions. — Michael Petch
– Michael Petch, Commented Aug 28, 2020 at 3:44

xuancong84 · Accepted Answer · 2020-08-25 12:02:48Z

0

Great, let us look at the test result:

#include <iostream>
#include <cstdlib>
#include <cstdio>
#include <time.h>
#include <immintrin.h>

#define N 256000000
using namespace std;

void f1a(float *a, float *b, int64_t n){
    asm("movq %0, %%rax"::"r"(a));
    asm("movq %0, %%rbx"::"r"(b));
    asm("movq %0, %%rcx"::"r"(n));

    asm(".intel_syntax noprefix\n"
        "shr rcx, 3\n"
"re:\n"
        "vmovaps ymm0, [rax]\n"
        "vmovaps ymm1, [rbx]\n"
        "vaddps ymm0, ymm0, ymm1\n"
        "vmovaps [rax], ymm0\n"
        "add rax, 32\n"
        "add rbx, 32\n"
        "loopnz re"
    );
}

void f1b(float *a, float *b, int64_t n){
    asm("movq %0, %%rax"::"r"(a));
    asm("movq %0, %%rbx"::"r"(b));
    asm("movq %0, %%rcx"::"r"(n));

    asm(".intel_syntax noprefix\n"
        "shr rcx, 3\n"
"re1:\n"
        "vmovaps ymm0, [rax]\n"
        "vmovaps ymm1, [rbx]\n"
        "vaddps ymm0, ymm0, ymm1\n"
        "vmovaps [rax], ymm0\n"
        "add rax, 32\n"
        "add rbx, 32\n"
        "dec rcx\n"
        "jnz re1"
    );
}

void f1c(float *a, float *b, int64_t n){
    asm("movq %0, %%rax"::"r"(a));
    asm("movq %0, %%rbx"::"r"(b));
    asm("movq %0, %%rcx"::"r"(n));

    asm(".intel_syntax noprefix\n"
"re2:\n"
        "sub rcx, 8\n"
        "vmovaps ymm0, [rax+rcx*4]\n"
        "vmovaps ymm1, [rbx+rcx*4]\n"
        "vaddps ymm0, ymm0, ymm1\n"
        "vmovaps [rax+rcx*4], ymm0\n"
        "jnz re2"
    );
}

void f2a(float *a, float *b, int64_t n){
    for(int i=n-8; i>=0; i-=8) {
        __m256 x8 = _mm256_load_ps(&a[i]);
        __m256 y8 = _mm256_load_ps(&b[i]);
        __m256 s = _mm256_add_ps(x8, y8);
        _mm256_store_ps(&a[i], s);
    }
}

void f2b(float *a, float *b, int64_t n){
    for(int i=(n>>3)-1; i>=0; --i) {
        __m256 x8 = _mm256_load_ps(&a[i*8]);
        __m256 y8 = _mm256_load_ps(&b[i*8]);
        __m256 s = _mm256_add_ps(x8, y8);
        _mm256_store_ps(&a[i*8], s);
    }
}

void f3(float *a, float *b, int64_t n){
    for(int i=n-1; i>=0; --i)
        a[i] += b[i];
}

void test(float *a, float *b, void(*func)(float*, float*, int64_t), char *name){
    clock_t t;
    printf("Testing %s():", name); fflush(stdout);
    t = clock();
    func(a, b, N);
    printf("%lu\n", clock()-t); fflush(stdout);
}

alignas(64) float x1[N];
alignas(64) float x2[N];

int main(int argc, const char *argv[]){
    printf("Preparing buffer ...");
    fflush(stdout);
    for(int x=0; x<N; ++x){
        x1[x] = x/10.0f;
        x2[x] = 0.5f+1.0f/(x+1);
    }
    printf("Done!\n");
    fflush(stdout);

    test(x1, x2, f3, "warm-up-cache");
    test(x1, x2, f1a, "f1a");
    test(x1, x2, f1b, "f1b");
    test(x1, x2, f1c, "f1c");
    test(x1, x2, f2a, "f2a");
    test(x1, x2, f2b, "f2b");
    test(x1, x2, f3, "f3");

    return 0;
}

Output:

Preparing buffer ...Done!
Testing warm-up-cache():551638
Testing f1a():179409
Testing f1b():159309
Testing f1c():172496
Testing f2a():247539
Testing f2b():245975
Testing f3():520559

Since inline assembly does not compile with -O3, I have commented out f1* and compile with -O3. The O3-test result is as follows:

Testing warm-up-cache():233775
Testing f2a():170199
Testing f2b():187909
Testing f3():181979

The improvement is not that significant on this simple example. However, the solution to the OP is still absent. Suggested-duplicate posts do not contain 64-bit Intel format solution.

answered Aug 25, 2020 at 12:02

xuancong84

1,61518 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

16 Comments

David Wohlferd Over a year ago

Rather than belabor the (still) bad asm, I've tried to clean that up a bit so we can see some comparisons (and build using -O3). It's hard to get good timings on godbolt, due to variations in server load. In general, f2a seems to perform better than f1a. f2b seems slightly worse than f1b, but that appears related to how n is passed. Replacing with N improves performance. Looks like intrinsics are safer, faster, and (way) easier to maintain. BTW: doesn't your asm write moving up and your c++ write moving down? Not exactly apples-apples.

Brett Hale Over a year ago

Your misuse of extended inline asm contributed to the question being closed in the first place. You cannot make assumptions about what the compiler is doing in between separate asm statements, and there's still no correct entries for input: output: clobber specifications. One example: you clobber %rbx, which should be 'callee'-saved. It's simply good luck that the program doesn't crash. Myself, and others re-opened the question mostly because the AVX encodings were interesting in themselves.

Brett Hale Over a year ago

Also - using clock() for timing is effectively useless here. The tests are too fast, introducing too much uncertainty. Try running each test 100 times, or 1000 times (for example) and / or use the rdtsc time-stamp counter.

David Wohlferd Over a year ago

"the register state will not change between consecutive asm blocks" - Says who? There is absolutely no guarantee of this. It might happen that way, or it might not. Indeed (per the docs): Do not expect a sequence of asm statements to remain perfectly consecutive after compilation, even when you are using the volatile qualifier. If certain instructions need to remain consecutive in the output, put them in a single multi-instruction asm statement. If something gets interleaved with your asm, registers can change.

David Wohlferd Over a year ago

"still work almost all the time" - Who wants code that works "almost" all the time? Follow the rules and your code will work dependably instead of "hoping" that it will run correctly this time. Have you tried using N instead of n in f2b as I suggested? And "walking up" for both the asm and c++ code (instead of up for asm and down for c++)? My expectation is that a correctly written f2b will outperform the asm in f1b. Which would call into question the whole motivation for writing inline asm at all.

|

Collectives™ on Stack Overflow

GCC inline assembly cannot load local variable's address into register in x64 Intel format?

1 Answer 1

16 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

16 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related