Inline ptx syntax for b128 load

Question

I want to load something int4 sized (aka a 16 byte struct) from memory, but there does not seem to be a constraint for b128.

__device__ int4 LoadVolatile(int4* that) {
    int4 result;
    asm("ld.volatile.global.b128 %0, [%1];" : "=r"(result) : "l"((void*)that));
    return result;
}

This gives me error:

error : an asm operand must have scalar type
asm("ld.volatile.global.b128 %0, [%1];" : "=r"(result) : "l"((void*)that));
^

However there is no 128 bit constraint, the table in Inline PTX assembly
only shows:

"h" = .u16 reg
"r" = .u32 reg
"l" = .u64 reg
"f" = .f32 reg
"d" = .f64 reg

There is no constraint for b128.

How do I change the above code to load an int4 into result?

Surely you want to use ld.volatile.global.v4.s32? That returns 4 values — talonmies
– talonmies, Commented Dec 5, 2024 at 15:02
If I do the following, it compiles: ` device int4 LoadVolatile(const int4* that) { int4 result; asm("ld.volatile.global.v4.u32 {%0, %1, %2, %3}, [%4];" : "=r"(result.x), "=r"(result.y), "=r"(result.z), "=r"(result.w) : "l"((void*)that)); return result; }` — Johan
– Johan, Commented Dec 5, 2024 at 15:13
But I still wonder how to do a b128 load, in case I do not have an easy translation to an int4 handy. — Johan
– Johan, Commented Dec 5, 2024 at 15:14
What type other than int4 would you want to store the b128 in? I don't think CUDA C++ has a 128 bit type. You always have some vector type with which you can use the appropriate vector load/store instruction — Homer512
– Homer512, Commented Dec 6, 2024 at 8:28

talonmies · Accepted Answer · 2024-12-08 06:45:55Z

Since the initial release of CUDA almost 20 years ago, PTX has supported vectorized loads and stores for the C++ vector types. In this case you want to use a “V4” version of the load, which in inline PTX will return the four members of the int4 viz:

__device__ int4 LoadVolatile(const int4* that) 
{ 
    int4 result; 
    asm("ld.volatile.global.v4.u32 {%0, %1, %2, %3}, [%4];" : 
        "=r"(result.x), 
        "=r"(result.y), 
        "=r"(result.z), 
        "=r"(result.w) : "l"((void*)that)); 
    return result; 
}

There is no constraint for b128

128 bit integral types (as opposed to 128 bit vector types) are a novelty in CUDA and only recent architectures support them. While a 128 bit integral load is the wrong solution here, it seems that the inline PTX facility in CUDA hasn’t been updated to include a 128 bit integral type constraint. If this really matters to you, I would suggest filing a feature request with NVIDIA.

Collectives™ on Stack Overflow

Inline ptx syntax for b128 load

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related