Skip to main content
Filter by
Sorted by
Tagged with
1 vote
1 answer
159 views

If I do a memcpy_async on a per thread basis, everything works fine, see the test_memcpy32 below. This code prefetches data within a single warp. I want to expand this, so that I can prefetch data in ...
Johan's user avatar
  • 77.4k
0 votes
1 answer
172 views

I am trying to share memory between two Vulkan processes (using the same NVidia gpu device) using the VK_KHR_external memory extension on Linux (Ubuntu 22). I create a buffer/device memory (and ...
pettersson's user avatar
-3 votes
1 answer
115 views

I have been read the book "Professional CUDA C Programming" and it shows two cases of bank conflicts: Two-way bank conflict Three-way bank conflict Figure below is how the words are mapped ...
kdh's user avatar
  • 194
4 votes
1 answer
130 views

I have been trying to understand how data broadcasting works. In terms of this fact, I have designed two distinct kernel (in the aspect of reading data from shared memory). I have tried compare the ...
log0xFF's user avatar
  • 43
0 votes
1 answer
131 views

Consider the following CUDA kernel that is used in Python via CuPy from the CuPy docs add_kernel = cp.RawKernel(r''' extern "C" __global__ void my_add(const float* x1, const float* x2, float*...
Uwe.Schneider's user avatar
0 votes
1 answer
152 views

I've queried the CUDA device (T1000 SM_75) and picked the values of some specific CUDA device attributes as follows. (Note: this question is a little bit lengthy ☺.) #include <cuda.h> #include &...
sof's user avatar
  • 9,767
3 votes
0 answers
362 views

I'm working on the render part of Assignment 2 for CMU's 15-418 course,which involves writing a high-performance renderer using CUDA. In my code, each CUDA thread is responsible for computing a single ...
Sunjnn's user avatar
  • 51
0 votes
1 answer
353 views

I am writing a CUDA C++ code for image filtering. CUDA separates the image data into blocks for parallel processing. For regular pixel-wise processing of course it is fast. However in image filtering, ...
MeiH's user avatar
  • 1,875
0 votes
1 answer
120 views

Wer are using dynamic shared memory in our CUDA kernels. We are setting the size of the shared memory for each kernel using the driver API cuFuncSetAttribute and ...
msedi's user avatar
  • 1,815
2 votes
0 answers
169 views

I expect a cuda shared memory bank conflict in the following two situations: Accessing successive half floats (2 words) with successive threads Accessing vectorized int4 datatypes by successive ...
fabian's user avatar
  • 1,881
0 votes
0 answers
74 views

The next kernel performs the multiplication of the matrices matA and matB and stores the result in the matrix matC (the size of all matrices is N) using a shared memory region with dimensions tiledim ...
Athanasios Margaris's user avatar
1 vote
0 answers
145 views

I'm working with different CUDA kernels (gemm3, gemm4, and gemm5) for matrix multiplication: gemm3: baseline of shared memory GEMM gemm4: less thread blocks in x dimension gemm5: less blocks in both ...
Worldbuffer's user avatar
0 votes
1 answer
68 views

I am trying to transpose a matrix. It works as expected for some values and starts crashing with bigger ones or even between executions of the program. What I am trying to make is to split the matrix ...
BrightSoul's user avatar
7 votes
1 answer
7k views

From the CUDA Programming Guide: [Warp shuffle functions] exchange a variable between threads within a warp. I understand that this is an alternative to shared memory, thus it's being used for ...
gonidelis's user avatar
  • 1,115
0 votes
1 answer
132 views

I was trying to reproduce a bank conflict scenario (minimal working example here) and decided to perform a benchmark when a warp (32 threads) access 32 integers of size 32-bits each in the following 2 ...
Ferdinand Mom's user avatar
7 votes
3 answers
3k views

I am currently studying CUDA and learned that there are global memory and shared memory. I have checked the CUDA document and found that GPUs can access shared memory and global memory using ld.shared/...
Tae's user avatar
  • 125
2 votes
0 answers
638 views

I faced a problem on training stylegan3 where the terminal stuck at "Setting up PyTorch plugin "upfirdn2d_plugin"... ". I have tried all the methods I found, such as reinstall ...
Fu Wenjin's user avatar
0 votes
1 answer
2k views

In the CUDA programming guide, in the shared memory section, it states that shared memory access by the warp is not serialized but broadcasted for reads. However it doesn't state what happens if the ...
Niteya Shah's user avatar
  • 1,824
0 votes
1 answer
2k views

My kernel allocated a shared memory for data storage, but bug reports if I change the size of the shared memory, see codes attached. #include <stdio.h> #include <assert.h> #define ...
Mangoccc's user avatar
1 vote
2 answers
2k views

I encountered a performance issue, where the shared memory's atomicAdd on float is much more expensive than it on int after profiling with nv-nsight-cu-cli. After checking the generated SASS, I found ...
JGL's user avatar
  • 158
1 vote
1 answer
969 views

I have a kernel where each warp accumulates the sum of a chunk of data. At the end of the calculation, I have a situation where the last lane of each warp has to send data to the first lane of the ...
Elad Maimoni's user avatar
  • 4,781
0 votes
1 answer
142 views

My questions arise while reading the last example of B.5. Memory Fence Functions. I understand the flag in this example checks the final block processing a sum. In my imagination, if the flag is ...
JGL's user avatar
  • 158
0 votes
1 answer
220 views

I have a data structure hash table, which has the linear probing hash scheme and is designed as lock-free with CAS. The hash table constexpr uint64_t HASH_EMPTY = 0xffffffffffffffff; struct OnceLock {...
JGL's user avatar
  • 158
0 votes
0 answers
290 views

Some CUDA devices support different shared memory bank sizes (4 bytes, 8 bytes); others support just one (typically/always 4 bytes). As I have come to realize, I won't get an error trying to set the ...
einpoklum's user avatar
  • 137k
3 votes
1 answer
706 views

I'm using inline PTX ld.shared to load data from shared memory: __shared__ float As[BLOCK_SIZE][BLOCK_SIZE]; //declare a buffer in shared memory float Csub = 0; As[TY][TX] = A[a + wA * TY + TX]; ...
Yichen's user avatar
  • 101

1
2 3 4 5
8