175 questions
1
vote
1
answer
284
views
How are fp6 and fp4 supported on NVIDIA Tensor Core on Blackwell?
I am writing PTX assembly code on CUDA C++ for research. This is my setup:
I have just downloaded the latest CUDA C++ toolkit (13.0) yesterday on WSL linux.
The local compilation environment does not ...
0
votes
1
answer
136
views
Inline ptx syntax for b128 load
I want to load something int4 sized (aka a 16 byte struct) from memory, but there does not seem to be a constraint for b128.
__device__ int4 LoadVolatile(int4* that) {
int4 result;
asm("...
0
votes
1
answer
651
views
Questions about mma instruction with Nvidia ptx
Hi my understanding about mma instruction with ptx is (please tell me if I'm wrong):
it is a per warp instruction
it need to load specific element into register of each thread within the target warp
...
3
votes
1
answer
143
views
Interaction between global stores and `bar.sync`
Suppose I have some PTX that looks like this:
st.global.v4.b32 ...
bar.sync
I know that bar.sync will block execution until the st.global.v4.b32 is visible to all threads in the block. My question is:...
3
votes
1
answer
513
views
When is shfl.sync.idx fast?
Using the .idx option of shfl.sync, it is possible to arbitrarily permute registers between threads in a single warp. The hope is that by using shfl.sync, you can avoid storing and then loading data ...
4
votes
2
answers
680
views
Does PTX (8.4) not cover smaller-shape WMMA instructions?
I want to use a SASS instruction which (AFAICT) is not available via a PTX instruction as of CUDA 12.4. Namely, suppose it is: HMMA.16816.F16 - a warp-wide matrix-multiply-and-add, of half-precision ...
-1
votes
1
answer
797
views
CUDA __shfl_down_sync does not work with __match_any_sync
Kernel foo's goal is to compute the sum of values that has the same id as id[0]. I checked that the mask acquired from __match_any_sync correctly identifies all other threads in the warp with the same ...
0
votes
1
answer
1k
views
Confusion about __cvta_generic_to_shared
Nvidia Ampere GPU support feature: cp async from global mem to shared mem bypass L1 and register file.
The corresponding PTX core is cp.async.
Why need __cvta_generic_to_shared to convert a shared ...
2
votes
0
answers
117
views
How to compare AT&T-assembly-like sources (e.g. CUDA PTX)?
I want to compare two pieces of low-level code, each in its own file. The format is AT&T-Assembly-style: For me, it's two CUDA PTX files, but this question applies also for the output of gcc -S or ...
1
vote
1
answer
981
views
Can I hint to CUDA that it should move a given variable into the L1 cache?
Can I hint to CUDA that it should asynchronously move a given variable into the L1 cache?
I have a deterministic data access pattern (crazy_access_order) that is unfortunately very ill-served by an ...
0
votes
1
answer
398
views
Is it bad that NVCC generates PTX code that is very generous with registers?
I recently read through the generated PTX code of a CUDA kernel. I realized that many registers are used to just store an intermediate value and are then never used again, and that NVCC generally ...
2
votes
1
answer
458
views
Are load and store operations in shared memory atomic?
I'm trying to figure out whether load and store operations on primitive types are atomics when we load/store from shared memory in CUDA.
On the one hand, it seems that any load/store is compiled to ...
-2
votes
1
answer
2k
views
Why Pytorch 1.7 with cuda10.1 cannot compatible with Nvidia A100 Ampere Architecture (according to PTX compatibilty pricinple)
According to Nvidia official documentation, if CUDA appliation is built to include PTX, because the PTX is forward-compatible, Meaning PTX is supported to run on any GPU with compute capability ...
1
vote
1
answer
169
views
Why does NVCC not optimize away ceilf() for literals?
(Followup question for Compile-time ceiling function, for literals, in C?)
Considering the following CUDA function:
__device__ int foo_f() { return ceilf(1007.1111); }
It should be easy to optimize ...
2
votes
1
answer
2k
views
Understanding the parameters of PTX instruction mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32
How to understand the parameters in the following snippet of CUDA inline assembly code?
......
asm volatile( \
"mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 \n" \
...
1
vote
1
answer
925
views
Disable CUDA PTX-to-binary JIT compilation
Is there a way to disable the Just-In-Time compilation of PTX code to GPU assembly when running an application?
There are certain scenarios where one want to run a GPU-enabled application on CPU-only ...
3
votes
1
answer
706
views
CUDA inline PTX ld.shared runs into cudaErrorIllegalAddress error
I'm using inline PTX ld.shared to load data from shared memory:
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; //declare a buffer in shared memory
float Csub = 0;
As[TY][TX] = A[a + wA * TY + TX]; ...
2
votes
1
answer
317
views
What does it mean when a variable "has been demoted" in the PTX?
In the function body of my CUDA kernel, I have a few __shared__ array variables, of a fixed size. When I look at the compiled PTX code (SM 7.5) for one of these arrays, I see a comment saying:
// ...
1
vote
1
answer
1k
views
Can I combine a "static" CUDA kernel launch with PTX code and get a working binary?
Suppose I take a CUDA program - for example the CUDA vectorAdd sample, and cut out the kernel's implementation, but still have the launch command:
vectorAdd<<<blocksPerGrid, threadsPerBlock&...
0
votes
1
answer
185
views
What do the %envregN special registers hold?
I've read: CUDA PTX code %envreg<32> special registers . The poster there was satisfied with not trying to treat OpenCL-originating PTX as a regular CUDA PTX. But - their question about %envN ...
0
votes
1
answer
493
views
What does %f, %rd mean in ptx assembly
Hi I've new to CUDA programming. I've got this piece of assembly code from building a program with OpenCL.
I came to wonder what those numbers and characters mean. Such as %f7, %f11, %rd3, %r3, %f, %p....
2
votes
1
answer
119
views
How to compile cuda code with calling one function twice inside one method?
I'am try to compile the piece of code:
struct foo {
unsigned long long x0;
};
//__device__ __noinline__ foo bar(foo a, foo b){ // << try this
__device__ foo bar(foo a, foo b){
foo r;
...
1
vote
1
answer
610
views
Using nvdisasm to generate control flow image of PTX code
I have a single file of CUDA code compiled to intermediate language PTX code, example.ptx. I would be interested to start poking around with this short file, trying to understand how it works.
I don't ...
1
vote
1
answer
118
views
Understanding Performance Behavior of Random Writes to Global Memory
I'm running experiments aiming to understand the behavior of random read and write access to global memory.
The following kernel reads from an input vector (groupColumn) with a coalesced access ...
3
votes
1
answer
832
views
In asm volatile inline PTX instructions, why also specify "memory" side effecs?
Consider the following excerpt from CUDA's Inline PTX Assebly guide (v10.2):
The compiler assumes that an asm() statement has no side effects
except to change the output operands. To ensure that ...