I want to load something int4 sized (aka a 16 byte struct) from memory, but there does not seem to be a constraint for b128.
__device__ int4 LoadVolatile(int4* that) {
int4 result;
asm("ld.volatile.global.b128 %0, [%1];" : "=r"(result) : "l"((void*)that));
return result;
}
This gives me error:
error : an asm operand must have scalar type
asm("ld.volatile.global.b128 %0, [%1];" : "=r"(result) : "l"((void*)that));
^
However there is no 128 bit constraint, the table in Inline PTX assembly
only shows:
"h" = .u16 reg
"r" = .u32 reg
"l" = .u64 reg
"f" = .f32 reg
"d" = .f64 reg
There is no constraint for b128.
How do I change the above code to load an int4 into result?
ld.volatile.global.v4.s32? That returns 4 valuesint4would you want to store theb128in? I don't think CUDA C++ has a 128 bit type. You always have some vector type with which you can use the appropriate vector load/store instruction