Newest 'gpu-shared-memory' Questions

1 vote

1 answer

159 views

memcpy_async does not work with pipeline roles

If I do a memcpy_async on a per thread basis, everything works fine, see the test_memcpy32 below. This code prefetches data within a single warp. I want to expand this, so that I can prefetch data in ...

Johan

77.4k

asked Oct 21 at 6:37

0 votes

1 answer

172 views

How to properly use VK_KHR_external_memory for sharing memory between two processes using Vulkan API

I am trying to share memory between two Vulkan processes (using the same NVidia gpu device) using the VK_KHR_external memory extension on Linux (Ubuntu 22). I create a buffer/device memory (and ...

pettersson

343

asked Jun 12 at 6:33

-3 votes

1 answer

115 views

N-way bank conflict on GPU shared memory in 64-bit mode and access order across words

I have been read the book "Professional CUDA C Programming" and it shows two cases of bank conflicts: Two-way bank conflict Three-way bank conflict Figure below is how the words are mapped ...

kdh

194

asked Mar 2 at 7:40

4 votes

1 answer

130 views

data broadcasting from shared memory bank

I have been trying to understand how data broadcasting works. In terms of this fact, I have designed two distinct kernel (in the aspect of reading data from shared memory). I have tried compare the ...

log0xFF

43

asked Jan 14 at 2:30

0 votes

1 answer

131 views

Raw kernel with dynamically allocated shared memory

Consider the following CUDA kernel that is used in Python via CuPy from the CuPy docs add_kernel = cp.RawKernel(r''' extern "C" __global__ void my_add(const float* x1, const float* x2, float*...

Uwe.Schneider

1,485

asked Oct 3, 2024 at 18:03

0 votes

1 answer

152 views

Estimated transactions on coalesced memory accesses

I've queried the CUDA device (T1000 SM_75) and picked the values of some specific CUDA device attributes as follows. (Note: this question is a little bit lengthy ☺.) #include <cuda.h> #include &...

sof

9,767

asked Jul 12, 2024 at 10:57

3 votes

0 answers

362 views

Bank Conflict Issue in CUDA Shared Memory Access

I'm working on the render part of Assignment 2 for CMU's 15-418 course,which involves writing a high-performance renderer using CUDA. In my code, each CUDA thread is responsible for computing a single ...

Sunjnn

51

asked Mar 28, 2024 at 2:21

0 votes

1 answer

353 views

Correct way of using cuda shared memory for image filtering

I am writing a CUDA C++ code for image filtering. CUDA separates the image data into blocks for parallel processing. For regular pixel-wise processing of course it is fast. However in image filtering, ...

MeiH

1,875

asked Mar 1, 2024 at 4:34

0 votes

1 answer

120 views

What is the difference of dynamic shared memory as kernel attribute and kernel argument in CUDA

Wer are using dynamic shared memory in our CUDA kernels. We are setting the size of the shared memory for each kernel using the driver API cuFuncSetAttribute and ...

msedi

1,815

asked Jan 19, 2024 at 15:18

2 votes

0 answers

169 views

Why is there no Shared Memory Bank conflict when loading consecutive half floats or vectorized int4?

I expect a cuda shared memory bank conflict in the following two situations: Accessing successive half floats (2 words) with successive threads Accessing vectorized int4 datatypes by successive ...

fabian

1,881

asked Jan 13, 2024 at 8:40

0 votes

0 answers

74 views

Why this code that uses dynamically allocated shared memory in CUDA does not work? [duplicate]

The next kernel performs the multiplication of the matrices matA and matB and stores the result in the matrix matC (the size of all matrices is N) using a shared memory region with dimensions tiledim ...

Athanasios Margaris

117

asked Nov 1, 2023 at 15:55

1 vote

0 answers

145 views

Understanding the Reduction in Bank Conflicts in CUDA Kernels

I'm working with different CUDA kernels (gemm3, gemm4, and gemm5) for matrix multiplication: gemm3: baseline of shared memory GEMM gemm4: less thread blocks in x dimension gemm5: less blocks in both ...

Worldbuffer

55

asked Aug 20, 2023 at 16:54

0 votes

1 answer

68 views

CUDA transpose kernel fails randomly

I am trying to transpose a matrix. It works as expected for some values and starts crashing with bigger ones or even between executions of the program. What I am trying to make is to split the matrix ...

BrightSoul

37

asked May 31, 2023 at 15:59

7 votes

1 answer

7k views

What is warp shuffling in CUDA and why is it useful?

From the CUDA Programming Guide: [Warp shuffle functions] exchange a variable between threads within a warp. I understand that this is an alternative to shared memory, thus it's being used for ...

gonidelis

1,115

asked Apr 27, 2023 at 18:54

0 votes

1 answer

132 views

CUDA shared memory bank conflict unexpected timing

I was trying to reproduce a bank conflict scenario (minimal working example here) and decided to perform a benchmark when a warp (32 threads) access 32 integers of size 32-bits each in the following 2 ...

Ferdinand Mom

59

asked Nov 20, 2022 at 23:45

7 votes

3 answers

3k views

In CUDA, what instruction is used to load data from global memory to shared memory?

I am currently studying CUDA and learned that there are global memory and shared memory. I have checked the CUDA document and found that GPUs can access shared memory and global memory using ld.shared/...

Tae

125

asked Nov 15, 2022 at 5:05

2 votes

0 answers

638 views

Problem with training stylegan3 - got stuck on Setting up PyTorch plugin "upfirdn2d_plugin"

I faced a problem on training stylegan3 where the terminal stuck at "Setting up PyTorch plugin "upfirdn2d_plugin"... ". I have tried all the methods I found, such as reinstall ...

Fu Wenjin

11

asked Jun 18, 2022 at 21:39

0 votes

1 answer

2k views

Does CUDA broadcast shared memory to all threads in a block without a bank conflict?

In the CUDA programming guide, in the shared memory section, it states that shared memory access by the warp is not serialized but broadcasted for reads. However it doesn't state what happens if the ...

Niteya Shah

1,824

asked Jun 1, 2022 at 23:26

0 votes

1 answer

2k views

Cuda misaligned address for a reused shared block memory

My kernel allocated a shared memory for data storage, but bug reports if I change the size of the shared memory, see codes attached. #include <stdio.h> #include <assert.h> #define ...

Mangoccc

41

asked Mar 5, 2022 at 8:15

1 vote

2 answers

2k views

Shared Memory's atomicAdd with int and float have different SASS

I encountered a performance issue, where the shared memory's atomicAdd on float is much more expensive than it on int after profiling with nv-nsight-cu-cli. After checking the generated SASS, I found ...

JGL

158

asked Jan 30, 2022 at 8:13

1 vote

1 answer

969 views

Can a Cuda warp communicate with a different warp without using shared memory?

I have a kernel where each warp accumulates the sum of a chunk of data. At the end of the calculation, I have a situation where the last lane of each warp has to send data to the first lane of the ...

Elad Maimoni

4,781

asked Jan 25, 2022 at 17:29

0 votes

1 answer

142 views

Thread synchronization necessity on a `volatile shared` flag

My questions arise while reading the last example of B.5. Memory Fence Functions. I understand the flag in this example checks the final block processing a sum. In my imagination, if the flag is ...

JGL

158

asked Dec 28, 2021 at 3:32

0 votes

1 answer

220 views

racecheck error from a data structure in shared memory

I have a data structure hash table, which has the linear probing hash scheme and is designed as lock-free with CAS. The hash table constexpr uint64_t HASH_EMPTY = 0xffffffffffffffff; struct OnceLock {...

JGL

158

asked Dec 15, 2021 at 11:51

0 votes

0 answers

290 views

How can I tell whether a CUDA device has a fixed shared memory bank size?

Some CUDA devices support different shared memory bank sizes (4 bytes, 8 bytes); others support just one (typically/always 4 bytes). As I have come to realize, I won't get an error trying to set the ...

einpoklum

137k

asked Nov 22, 2021 at 11:42

3 votes

1 answer

706 views

CUDA inline PTX ld.shared runs into cudaErrorIllegalAddress error

I'm using inline PTX ld.shared to load data from shared memory: __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; //declare a buffer in shared memory float Csub = 0; As[TY][TX] = A[a + wA * TY + TX]; ...

Yichen

101

asked Nov 12, 2021 at 0:07

Collectives™ on Stack Overflow

memcpy_async does not work with pipeline roles

How to properly use VK_KHR_external_memory for sharing memory between two processes using Vulkan API

N-way bank conflict on GPU shared memory in 64-bit mode and access order across words

data broadcasting from shared memory bank

Raw kernel with dynamically allocated shared memory

Estimated transactions on coalesced memory accesses

Bank Conflict Issue in CUDA Shared Memory Access

Correct way of using cuda shared memory for image filtering

What is the difference of dynamic shared memory as kernel attribute and kernel argument in CUDA

Why is there no Shared Memory Bank conflict when loading consecutive half floats or vectorized int4?

Why this code that uses dynamically allocated shared memory in CUDA does not work? [duplicate]

Understanding the Reduction in Bank Conflicts in CUDA Kernels

CUDA transpose kernel fails randomly

What is warp shuffling in CUDA and why is it useful?

CUDA shared memory bank conflict unexpected timing

In CUDA, what instruction is used to load data from global memory to shared memory?

Problem with training stylegan3 - got stuck on Setting up PyTorch plugin "upfirdn2d_plugin"

Does CUDA broadcast shared memory to all threads in a block without a bank conflict?

Cuda misaligned address for a reused shared block memory

Shared Memory's atomicAdd with int and float have different SASS

Can a Cuda warp communicate with a different warp without using shared memory?

Thread synchronization necessity on a `volatile shared` flag

racecheck error from a data structure in shared memory

How can I tell whether a CUDA device has a fixed shared memory bank size?

CUDA inline PTX ld.shared runs into cudaErrorIllegalAddress error

Hot Network Questions