Newest 'ptx' Questions

1 vote

1 answer

284 views

How are fp6 and fp4 supported on NVIDIA Tensor Core on Blackwell?

I am writing PTX assembly code on CUDA C++ for research. This is my setup: I have just downloaded the latest CUDA C++ toolkit (13.0) yesterday on WSL linux. The local compilation environment does not ...

Junhao Liu

11

asked Aug 14 at 10:03

0 votes

1 answer

136 views

Inline ptx syntax for b128 load

I want to load something int4 sized (aka a 16 byte struct) from memory, but there does not seem to be a constraint for b128. __device__ int4 LoadVolatile(int4* that) { int4 result; asm("...

Johan

77.4k

asked Dec 5, 2024 at 14:27

0 votes

1 answer

651 views

Questions about mma instruction with Nvidia ptx

Hi my understanding about mma instruction with ptx is (please tell me if I'm wrong): it is a per warp instruction it need to load specific element into register of each thread within the target warp ...

Shore

1,059

asked Jul 15, 2024 at 2:10

3 votes

1 answer

143 views

Interaction between global stores and `bar.sync`

Suppose I have some PTX that looks like this: st.global.v4.b32 ... bar.sync I know that bar.sync will block execution until the st.global.v4.b32 is visible to all threads in the block. My question is:...

Elliot Gorokhovsky

3,842

asked May 20, 2024 at 18:52

3 votes

1 answer

513 views

When is shfl.sync.idx fast?

Using the .idx option of shfl.sync, it is possible to arbitrarily permute registers between threads in a single warp. The hope is that by using shfl.sync, you can avoid storing and then loading data ...

Elliot Gorokhovsky

3,842

asked May 17, 2024 at 15:36

4 votes

2 answers

680 views

Does PTX (8.4) not cover smaller-shape WMMA instructions?

I want to use a SASS instruction which (AFAICT) is not available via a PTX instruction as of CUDA 12.4. Namely, suppose it is: HMMA.16816.F16 - a warp-wide matrix-multiply-and-add, of half-precision ...

einpoklum

137k

asked Mar 12, 2024 at 12:01

-1 votes

1 answer

797 views

CUDA __shfl_down_sync does not work with __match_any_sync

Kernel foo's goal is to compute the sum of values that has the same id as id[0]. I checked that the mask acquired from __match_any_sync correctly identifies all other threads in the warp with the same ...

SnowSR

3

asked Sep 28, 2023 at 0:19

0 votes

1 answer

1k views

Confusion about __cvta_generic_to_shared

Nvidia Ampere GPU support feature: cp async from global mem to shared mem bypass L1 and register file. The corresponding PTX core is cp.async. Why need __cvta_generic_to_shared to convert a shared ...

foreverrookie

3

asked Aug 28, 2023 at 12:47

2 votes

0 answers

117 views

How to compare AT&T-assembly-like sources (e.g. CUDA PTX)?

I want to compare two pieces of low-level code, each in its own file. The format is AT&T-Assembly-style: For me, it's two CUDA PTX files, but this question applies also for the output of gcc -S or ...

einpoklum

137k

asked Nov 30, 2022 at 13:54

1 vote

1 answer

981 views

Can I hint to CUDA that it should move a given variable into the L1 cache?

Can I hint to CUDA that it should asynchronously move a given variable into the L1 cache? I have a deterministic data access pattern (crazy_access_order) that is unfortunately very ill-served by an ...

emchristiansen

3,652

asked Oct 13, 2022 at 20:31

0 votes

1 answer

398 views

Is it bad that NVCC generates PTX code that is very generous with registers?

I recently read through the generated PTX code of a CUDA kernel. I realized that many registers are used to just store an intermediate value and are then never used again, and that NVCC generally ...

Niels Slotboom

23

asked Jul 20, 2022 at 18:20

2 votes

1 answer

458 views

Are load and store operations in shared memory atomic?

I'm trying to figure out whether load and store operations on primitive types are atomics when we load/store from shared memory in CUDA. On the one hand, it seems that any load/store is compiled to ...

Pierre T.

388

asked Jun 11, 2022 at 10:48

-2 votes

1 answer

2k views

Why Pytorch 1.7 with cuda10.1 cannot compatible with Nvidia A100 Ampere Architecture (according to PTX compatibilty pricinple)

According to Nvidia official documentation, if CUDA appliation is built to include PTX, because the PTX is forward-compatible, Meaning PTX is supported to run on any GPU with compute capability ...

Seven link bob

1

asked Mar 3, 2022 at 6:46

1 vote

1 answer

169 views

Why does NVCC not optimize away ceilf() for literals?

(Followup question for Compile-time ceiling function, for literals, in C?) Considering the following CUDA function: __device__ int foo_f() { return ceilf(1007.1111); } It should be easy to optimize ...

einpoklum

137k

asked Feb 13, 2022 at 22:14

2 votes

1 answer

2k views

Understanding the parameters of PTX instruction mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32

How to understand the parameters in the following snippet of CUDA inline assembly code? ...... asm volatile( \ "mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 \n" \ ...

inprocess

31

asked Jan 12, 2022 at 15:13

1 vote

1 answer

925 views

Disable CUDA PTX-to-binary JIT compilation

Is there a way to disable the Just-In-Time compilation of PTX code to GPU assembly when running an application? There are certain scenarios where one want to run a GPU-enabled application on CPU-only ...

648trindade

748

asked Nov 25, 2021 at 18:10

3 votes

1 answer

706 views

CUDA inline PTX ld.shared runs into cudaErrorIllegalAddress error

I'm using inline PTX ld.shared to load data from shared memory: __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; //declare a buffer in shared memory float Csub = 0; As[TY][TX] = A[a + wA * TY + TX]; ...

Yichen

101

asked Nov 12, 2021 at 0:07

2 votes

1 answer

317 views

What does it mean when a variable "has been demoted" in the PTX?

In the function body of my CUDA kernel, I have a few __shared__ array variables, of a fixed size. When I look at the compiled PTX code (SM 7.5) for one of these arrays, I see a comment saying: // ...

einpoklum

137k

asked Jul 27, 2021 at 10:20

1 vote

1 answer

1k views

Can I combine a "static" CUDA kernel launch with PTX code and get a working binary?

Suppose I take a CUDA program - for example the CUDA vectorAdd sample, and cut out the kernel's implementation, but still have the launch command: vectorAdd<<<blocksPerGrid, threadsPerBlock&...

einpoklum

137k

asked May 29, 2021 at 21:05

0 votes

1 answer

185 views

What do the %envregN special registers hold?

I've read: CUDA PTX code %envreg<32> special registers . The poster there was satisfied with not trying to treat OpenCL-originating PTX as a regular CUDA PTX. But - their question about %envN ...

einpoklum

137k

asked May 20, 2021 at 12:47

0 votes

1 answer

493 views

What does %f, %rd mean in ptx assembly

Hi I've new to CUDA programming. I've got this piece of assembly code from building a program with OpenCL. I came to wonder what those numbers and characters mean. Such as %f7, %f11, %rd3, %r3, %f, %p....

Aesop

161

asked May 15, 2021 at 20:25

2 votes

1 answer

119 views

How to compile cuda code with calling one function twice inside one method?

I'am try to compile the piece of code: struct foo { unsigned long long x0; }; //__device__ __noinline__ foo bar(foo a, foo b){ // << try this __device__ foo bar(foo a, foo b){ foo r; ...

chabapok

962

asked Feb 28, 2021 at 0:01

1 vote

1 answer

610 views

Using nvdisasm to generate control flow image of PTX code

I have a single file of CUDA code compiled to intermediate language PTX code, example.ptx. I would be interested to start poking around with this short file, trying to understand how it works. I don't ...

Toothery

195

asked Jan 29, 2021 at 10:41

1 vote

1 answer

118 views

Understanding Performance Behavior of Random Writes to Global Memory

I'm running experiments aiming to understand the behavior of random read and write access to global memory. The following kernel reads from an input vector (groupColumn) with a coalesced access ...

Christoph Schulze

41

asked Dec 8, 2020 at 10:22

3 votes

1 answer

832 views

In asm volatile inline PTX instructions, why also specify "memory" side effecs?

Consider the following excerpt from CUDA's Inline PTX Assebly guide (v10.2): The compiler assumes that an asm() statement has no side effects except to change the output operands. To ensure that ...

einpoklum

137k

asked Apr 29, 2020 at 9:02

Collectives™ on Stack Overflow

How are fp6 and fp4 supported on NVIDIA Tensor Core on Blackwell?

Inline ptx syntax for b128 load

Questions about mma instruction with Nvidia ptx

Interaction between global stores and `bar.sync`

When is shfl.sync.idx fast?

Does PTX (8.4) not cover smaller-shape WMMA instructions?

CUDA __shfl_down_sync does not work with __match_any_sync

Confusion about __cvta_generic_to_shared

How to compare AT&T-assembly-like sources (e.g. CUDA PTX)?

Can I hint to CUDA that it should move a given variable into the L1 cache?

Is it bad that NVCC generates PTX code that is very generous with registers?

Are load and store operations in shared memory atomic?

Why Pytorch 1.7 with cuda10.1 cannot compatible with Nvidia A100 Ampere Architecture (according to PTX compatibilty pricinple)

Why does NVCC not optimize away ceilf() for literals?

Understanding the parameters of PTX instruction mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32

Disable CUDA PTX-to-binary JIT compilation

CUDA inline PTX ld.shared runs into cudaErrorIllegalAddress error

What does it mean when a variable "has been demoted" in the PTX?

Can I combine a "static" CUDA kernel launch with PTX code and get a working binary?

What do the %envregN special registers hold?

What does %f, %rd mean in ptx assembly

How to compile cuda code with calling one function twice inside one method?

Using nvdisasm to generate control flow image of PTX code

Understanding Performance Behavior of Random Writes to Global Memory

In asm volatile inline PTX instructions, why also specify "memory" side effecs?

Hot Network Questions