360 questions
1
vote
1
answer
159
views
memcpy_async does not work with pipeline roles
If I do a memcpy_async on a per thread basis, everything works fine, see the test_memcpy32 below.
This code prefetches data within a single warp.
I want to expand this, so that I can prefetch data in ...
0
votes
1
answer
172
views
How to properly use VK_KHR_external_memory for sharing memory between two processes using Vulkan API
I am trying to share memory between two Vulkan processes (using the same NVidia gpu device) using the VK_KHR_external memory extension on Linux (Ubuntu 22). I create a buffer/device memory (and ...
-3
votes
1
answer
115
views
N-way bank conflict on GPU shared memory in 64-bit mode and access order across words
I have been read the book "Professional CUDA C Programming" and it shows two cases of bank conflicts:
Two-way bank conflict
Three-way bank conflict
Figure below is how the words are mapped ...
4
votes
1
answer
130
views
data broadcasting from shared memory bank
I have been trying to understand how data broadcasting works. In terms of this fact, I have designed two distinct kernel (in the aspect of reading data from shared memory). I have tried compare the ...
0
votes
1
answer
131
views
Raw kernel with dynamically allocated shared memory
Consider the following CUDA kernel that is used in Python via CuPy from the CuPy docs
add_kernel = cp.RawKernel(r'''
extern "C" __global__
void my_add(const float* x1, const float* x2, float*...
0
votes
1
answer
152
views
Estimated transactions on coalesced memory accesses
I've queried the CUDA device (T1000 SM_75) and picked the values of some specific CUDA device attributes as follows. (Note: this question is a little bit lengthy ☺.)
#include <cuda.h>
#include &...
3
votes
0
answers
362
views
Bank Conflict Issue in CUDA Shared Memory Access
I'm working on the render part of Assignment 2 for CMU's 15-418 course,which involves writing a high-performance renderer using CUDA. In my code, each CUDA thread is responsible for computing a single ...
0
votes
1
answer
353
views
Correct way of using cuda __shared__ memory for image filtering
I am writing a CUDA C++ code for image filtering. CUDA separates the image data into blocks for parallel processing. For regular pixel-wise processing of course it is fast. However in image filtering, ...
0
votes
1
answer
120
views
What is the difference of dynamic shared memory as kernel attribute and kernel argument in CUDA
Wer are using dynamic shared memory in our CUDA kernels. We are setting the size of the shared memory for each kernel using the driver API cuFuncSetAttribute and ...
2
votes
0
answers
169
views
Why is there no Shared Memory Bank conflict when loading consecutive half floats or vectorized int4?
I expect a cuda shared memory bank conflict in the following two situations:
Accessing successive half floats (2 words) with successive threads
Accessing vectorized int4 datatypes by successive ...
0
votes
0
answers
74
views
Why this code that uses dynamically allocated shared memory in CUDA does not work? [duplicate]
The next kernel performs the multiplication of the matrices matA and matB and stores the result in the matrix matC (the size of all matrices is N) using a shared memory region with dimensions tiledim ...
1
vote
0
answers
145
views
Understanding the Reduction in Bank Conflicts in CUDA Kernels
I'm working with different CUDA kernels (gemm3, gemm4, and gemm5) for matrix multiplication:
gemm3: baseline of shared memory GEMM
gemm4: less thread blocks in x dimension
gemm5: less blocks in both ...
0
votes
1
answer
68
views
CUDA transpose kernel fails randomly
I am trying to transpose a matrix. It works as expected for some values and starts crashing with bigger ones or even between executions of the program.
What I am trying to make is to split the matrix ...
7
votes
1
answer
7k
views
What is warp shuffling in CUDA and why is it useful?
From the CUDA Programming Guide:
[Warp shuffle functions] exchange a variable between threads within a warp.
I understand that this is an alternative to shared memory, thus it's being used for ...
0
votes
1
answer
132
views
CUDA shared memory bank conflict unexpected timing
I was trying to reproduce a bank conflict scenario (minimal working example here) and decided to perform a benchmark when a warp (32 threads) access 32 integers of size 32-bits each in the following 2 ...
7
votes
3
answers
3k
views
In CUDA, what instruction is used to load data from global memory to shared memory?
I am currently studying CUDA and learned that there are global memory and shared memory.
I have checked the CUDA document and found that GPUs can access shared memory and global memory using ld.shared/...
2
votes
0
answers
638
views
Problem with training stylegan3 - got stuck on Setting up PyTorch plugin "upfirdn2d_plugin"
I faced a problem on training stylegan3 where the terminal stuck at "Setting up PyTorch plugin "upfirdn2d_plugin"... ". I have tried all the methods I found, such as reinstall ...
0
votes
1
answer
2k
views
Does CUDA broadcast shared memory to all threads in a block without a bank conflict?
In the CUDA programming guide, in the shared memory section, it states that shared memory access by the warp is not serialized but broadcasted for reads.
However it doesn't state what happens if the ...
0
votes
1
answer
2k
views
Cuda misaligned address for a reused shared block memory
My kernel allocated a shared memory for data storage, but bug reports if I change the size of the shared memory, see codes attached.
#include <stdio.h>
#include <assert.h>
#define ...
1
vote
2
answers
2k
views
Shared Memory's atomicAdd with int and float have different SASS
I encountered a performance issue, where the shared memory's atomicAdd on float is much more expensive than it on int after profiling with nv-nsight-cu-cli.
After checking the generated SASS, I found ...
1
vote
1
answer
969
views
Can a Cuda warp communicate with a different warp without using shared memory?
I have a kernel where each warp accumulates the sum of a chunk of data.
At the end of the calculation, I have a situation where the last lane of each warp has to send data to the first lane of the ...
0
votes
1
answer
142
views
Thread synchronization necessity on a `volatile __shared__` flag
My questions arise while reading the last example of B.5. Memory Fence Functions.
I understand the flag in this example checks the final block processing a sum.
In my imagination, if the flag is ...
0
votes
1
answer
220
views
racecheck error from a data structure in shared memory
I have a data structure hash table, which has the linear probing hash scheme and is designed as lock-free with CAS.
The hash table
constexpr uint64_t HASH_EMPTY = 0xffffffffffffffff;
struct OnceLock {...
0
votes
0
answers
290
views
How can I tell whether a CUDA device has a fixed shared memory bank size?
Some CUDA devices support different shared memory bank sizes (4 bytes, 8 bytes); others support just one (typically/always 4 bytes).
As I have come to realize, I won't get an error trying to set the ...
3
votes
1
answer
706
views
CUDA inline PTX ld.shared runs into cudaErrorIllegalAddress error
I'm using inline PTX ld.shared to load data from shared memory:
__shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; //declare a buffer in shared memory
float Csub = 0;
As[TY][TX] = A[a + wA * TY + TX]; ...