Skip to main content
Filter by
Sorted by
Tagged with
1 vote
1 answer
284 views

I am writing PTX assembly code on CUDA C++ for research. This is my setup: I have just downloaded the latest CUDA C++ toolkit (13.0) yesterday on WSL linux. The local compilation environment does not ...
Junhao Liu's user avatar
0 votes
1 answer
136 views

I want to load something int4 sized (aka a 16 byte struct) from memory, but there does not seem to be a constraint for b128. __device__ int4 LoadVolatile(int4* that) { int4 result; asm("...
Johan's user avatar
  • 77.4k
0 votes
1 answer
651 views

Hi my understanding about mma instruction with ptx is (please tell me if I'm wrong): it is a per warp instruction it need to load specific element into register of each thread within the target warp ...
Shore's user avatar
  • 1,059
3 votes
1 answer
143 views

Suppose I have some PTX that looks like this: st.global.v4.b32 ... bar.sync I know that bar.sync will block execution until the st.global.v4.b32 is visible to all threads in the block. My question is:...
Elliot Gorokhovsky's user avatar
3 votes
1 answer
513 views

Using the .idx option of shfl.sync, it is possible to arbitrarily permute registers between threads in a single warp. The hope is that by using shfl.sync, you can avoid storing and then loading data ...
Elliot Gorokhovsky's user avatar
4 votes
2 answers
680 views

I want to use a SASS instruction which (AFAICT) is not available via a PTX instruction as of CUDA 12.4. Namely, suppose it is: HMMA.16816.F16 - a warp-wide matrix-multiply-and-add, of half-precision ...
einpoklum's user avatar
  • 137k
-1 votes
1 answer
797 views

Kernel foo's goal is to compute the sum of values that has the same id as id[0]. I checked that the mask acquired from __match_any_sync correctly identifies all other threads in the warp with the same ...
SnowSR's user avatar
  • 3
0 votes
1 answer
1k views

Nvidia Ampere GPU support feature: cp async from global mem to shared mem bypass L1 and register file. The corresponding PTX core is cp.async. Why need __cvta_generic_to_shared to convert a shared ...
foreverrookie's user avatar
2 votes
0 answers
117 views

I want to compare two pieces of low-level code, each in its own file. The format is AT&T-Assembly-style: For me, it's two CUDA PTX files, but this question applies also for the output of gcc -S or ...
einpoklum's user avatar
  • 137k
1 vote
1 answer
981 views

Can I hint to CUDA that it should asynchronously move a given variable into the L1 cache? I have a deterministic data access pattern (crazy_access_order) that is unfortunately very ill-served by an ...
emchristiansen's user avatar
0 votes
1 answer
398 views

I recently read through the generated PTX code of a CUDA kernel. I realized that many registers are used to just store an intermediate value and are then never used again, and that NVCC generally ...
Niels Slotboom's user avatar
2 votes
1 answer
458 views

I'm trying to figure out whether load and store operations on primitive types are atomics when we load/store from shared memory in CUDA. On the one hand, it seems that any load/store is compiled to ...
Pierre T.'s user avatar
  • 388
-2 votes
1 answer
2k views

According to Nvidia official documentation, if CUDA appliation is built to include PTX, because the PTX is forward-compatible, Meaning PTX is supported to run on any GPU with compute capability ...
Seven link bob's user avatar
1 vote
1 answer
169 views

(Followup question for Compile-time ceiling function, for literals, in C?) Considering the following CUDA function: __device__ int foo_f() { return ceilf(1007.1111); } It should be easy to optimize ...
einpoklum's user avatar
  • 137k
2 votes
1 answer
2k views

How to understand the parameters in the following snippet of CUDA inline assembly code? ...... asm volatile( \ "mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 \n" \ ...
inprocess's user avatar
1 vote
1 answer
925 views

Is there a way to disable the Just-In-Time compilation of PTX code to GPU assembly when running an application? There are certain scenarios where one want to run a GPU-enabled application on CPU-only ...
648trindade's user avatar
3 votes
1 answer
706 views

I'm using inline PTX ld.shared to load data from shared memory: __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; //declare a buffer in shared memory float Csub = 0; As[TY][TX] = A[a + wA * TY + TX]; ...
Yichen's user avatar
  • 101
2 votes
1 answer
317 views

In the function body of my CUDA kernel, I have a few __shared__ array variables, of a fixed size. When I look at the compiled PTX code (SM 7.5) for one of these arrays, I see a comment saying: // ...
einpoklum's user avatar
  • 137k
1 vote
1 answer
1k views

Suppose I take a CUDA program - for example the CUDA vectorAdd sample, and cut out the kernel's implementation, but still have the launch command: vectorAdd<<<blocksPerGrid, threadsPerBlock&...
einpoklum's user avatar
  • 137k
0 votes
1 answer
185 views

I've read: CUDA PTX code %envreg<32> special registers . The poster there was satisfied with not trying to treat OpenCL-originating PTX as a regular CUDA PTX. But - their question about %envN ...
einpoklum's user avatar
  • 137k
0 votes
1 answer
493 views

Hi I've new to CUDA programming. I've got this piece of assembly code from building a program with OpenCL. I came to wonder what those numbers and characters mean. Such as %f7, %f11, %rd3, %r3, %f, %p....
Aesop's user avatar
  • 161
2 votes
1 answer
119 views

I'am try to compile the piece of code: struct foo { unsigned long long x0; }; //__device__ __noinline__ foo bar(foo a, foo b){ // << try this __device__ foo bar(foo a, foo b){ foo r; ...
chabapok's user avatar
  • 962
1 vote
1 answer
610 views

I have a single file of CUDA code compiled to intermediate language PTX code, example.ptx. I would be interested to start poking around with this short file, trying to understand how it works. I don't ...
Toothery's user avatar
  • 195
1 vote
1 answer
118 views

I'm running experiments aiming to understand the behavior of random read and write access to global memory. The following kernel reads from an input vector (groupColumn) with a coalesced access ...
Christoph Schulze's user avatar
3 votes
1 answer
832 views

Consider the following excerpt from CUDA's Inline PTX Assebly guide (v10.2): The compiler assumes that an asm() statement has no side effects except to change the output operands. To ensure that ...
einpoklum's user avatar
  • 137k