0

I want to load something int4 sized (aka a 16 byte struct) from memory, but there does not seem to be a constraint for b128.

__device__ int4 LoadVolatile(int4* that) {
    int4 result;
    asm("ld.volatile.global.b128 %0, [%1];" : "=r"(result) : "l"((void*)that));
    return result;
}

This gives me error:

error : an asm operand must have scalar type
asm("ld.volatile.global.b128 %0, [%1];" : "=r"(result) : "l"((void*)that));
^

However there is no 128 bit constraint, the table in Inline PTX assembly
only shows:

"h" = .u16 reg
"r" = .u32 reg
"l" = .u64 reg
"f" = .f32 reg
"d" = .f64 reg

There is no constraint for b128.

How do I change the above code to load an int4 into result?

4
  • 1
    Surely you want to use ld.volatile.global.v4.s32? That returns 4 values Commented Dec 5, 2024 at 15:02
  • If I do the following, it compiles: ` device int4 LoadVolatile(const int4* that) { int4 result; asm("ld.volatile.global.v4.u32 {%0, %1, %2, %3}, [%4];" : "=r"(result.x), "=r"(result.y), "=r"(result.z), "=r"(result.w) : "l"((void*)that)); return result; }` Commented Dec 5, 2024 at 15:13
  • But I still wonder how to do a b128 load, in case I do not have an easy translation to an int4 handy. Commented Dec 5, 2024 at 15:14
  • What type other than int4 would you want to store the b128 in? I don't think CUDA C++ has a 128 bit type. You always have some vector type with which you can use the appropriate vector load/store instruction Commented Dec 6, 2024 at 8:28

1 Answer 1

2

Since the initial release of CUDA almost 20 years ago, PTX has supported vectorized loads and stores for the C++ vector types. In this case you want to use a “V4” version of the load, which in inline PTX will return the four members of the int4 viz:

__device__ int4 LoadVolatile(const int4* that) 
{ 
    int4 result; 
    asm("ld.volatile.global.v4.u32 {%0, %1, %2, %3}, [%4];" : 
        "=r"(result.x), 
        "=r"(result.y), 
        "=r"(result.z), 
        "=r"(result.w) : "l"((void*)that)); 
    return result; 
}

There is no constraint for b128

128 bit integral types (as opposed to 128 bit vector types) are a novelty in CUDA and only recent architectures support them. While a 128 bit integral load is the wrong solution here, it seems that the inline PTX facility in CUDA hasn’t been updated to include a 128 bit integral type constraint. If this really matters to you, I would suggest filing a feature request with NVIDIA.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.