89 questions
0
votes
1
answer
99
views
CPU-GPU producer-consumer pattern using unified memory but GPU is in spin loop
I am trying to implement producer consumer problem in GPU-CPU. Required for some other project. GPU requests some data via Unified memory to CPU. CPU copies that data to a specific location in global ...
-1
votes
2
answers
158
views
Are "cuWhateverAsync" calls with stream handle NULL universally equivalent to "cuWhatever" calls?
The CUDA Driver API has multiple calls with an "asynchronous" variant, e.g. cuMemcpy2D and cuMemcpy2DAsync, with the "asynchronous" variant taking a stream handle - and there are ...
1
vote
1
answer
33
views
Some TensorRT conv layer forward blocked by cudaMemcpyAsync from another thread
See this nsys profile:
I have observed that during the forward pass of some layers in TensorRT execution, a lock is acquired before launching the kernel.
I attempted to determine the specific lock ...
3
votes
0
answers
215
views
Why is using multiple CUDA streams not improving performance as expected?
I am working on optimizing a CUDA application that processes a matrix by updating each row sequentially. The process involves three main kernels:
Kernel 1: Updates the pivot element of the current row....
2
votes
2
answers
818
views
How to Use CUDA Graphs with Interdependent Streams and Dynamic Parameters?
I have a CUDA program with multiple interdependent streams, and I want to convert it to use CUDA graphs to reduce launch overhead and improve performance. My program involves launching three kernels (...
3
votes
0
answers
190
views
Compute and Data transfer not happening concurrently in cuda Streams on Iteration 2
I have written a basic program where a chunk of data is loaded in CPU memory (Pinned), and then I transfer it in chunks to GPUs (Asynchronously), and then do computation on each chunk. So for each ...
0
votes
1
answer
154
views
What are the semantics of CUDA kernel launch priorities?
Starting with CUDA 12.0, one can specify a wider variety of "launch attributes", when launching a kernel, using a CUlaunchConfig structure; and one of the attributes we can place in a launch ...
2
votes
1
answer
208
views
Why am I unable to establish a pipeline when using multiple concurrent streams in CUDA programming?
I wish to construct a pipeline using multiple streams. Below is the code I have written:
using namespace std;
__global__ void vecAdd(float *c, const float *a, const float *b);
void initBuffer(float *...
1
vote
0
answers
398
views
Does a CUDA stream "become active" after execution of a scheduled host function concludes?
The CUDA documentation for scheduling the launching a host function (cuLaunchHostFunc) says:
Completion of the function does not cause a stream to become active except as described above.
I couldn't ...
0
votes
1
answer
113
views
What does CU_MEMPOOL_ATTR_REUSE_ALLOW_OPPORTUNISTIC actually allow?
One of the attributes of CUDA memory pools is CU_MEMPOOL_ATTR_REUSE_ALLOW_OPPORTUNISTIC, described in the doxygen as follows:
Allow reuse of already completed frees when there is no dependency ...
1
vote
1
answer
788
views
Is there a way to block and unblock a CUDA stream arbitrarily?
I need to pause the execution of all calls in a stream from a certain point in one part of the program until another part of the program decides to unpause this stream at an arbitrary time. This is ...
4
votes
2
answers
2k
views
What's the capacity of a CUDA stream (=queue)?
A CUDA stream is a queue of tasks: memory copies, event firing, event waits, kernel launches, callbacks...
But - these queues don't have infinite capacity. In fact, empirically, I find that this limit ...
2
votes
2
answers
2k
views
Using multi streams in cuda graph, the execution order is uncontrolled
I am using cuda graph stream capture API to implement a small demo with multi streams. Referenced by the CUDA Programming Guide here, I wrote the complete code. In my knowledge, kernelB should execute ...
2
votes
1
answer
990
views
How can I pause a CUDA stream and then resume it?
Suppose we have two CUDA streams running two CUDA kernels on a GPU at the same time. How can I pause the CUDA kernel running with the instruction I putting in the host code and resume it with the ...
0
votes
1
answer
717
views
Reusing cudaEvent to serialize multiple streams
Suppose I have a struct:
typedef enum {ON_CPU,ON_GPU,ON_BOTH} memLocation;
typedef struct foo *foo;
struct foo {
cudaEvent_t event;
float *deviceArray;
float *hostArray;
...
0
votes
1
answer
233
views
CUDA cudaMemcpyAsync using single stream to host
I have a single kernel which is feeling data to two parameters (dev_out_1 and dev_out_2) using single stream. I wanted to copy back the data from the device to host in parallel.
my requirement is to ...
0
votes
1
answer
417
views
Is it possible to manually set the SMs used for one CUDA stream?
By default, the kernel will use all available SMs of the device (if enough blocks). However, now I have 2 stream with one computational-intense and one memory-intense, and I want to limit the maximal ...
0
votes
1
answer
398
views
Overlapping transfers and kernel executions in CUDA with two loops
I want to overlap data transfers and kernel executions in a form like this:
int numStreams = 3;
int size = 10;
for(int i = 0; i < size; i++) {
cuMemcpyHtoDAsync( _bufferIn1,
...
3
votes
0
answers
1k
views
Execute another model in parallel to a model's forward pass with PyTorch
I am trying to make some changes to the ResNet-18 model in PyTorch to invoke the execution of another auxiliary trained model which takes in the ResNet intermediate layer output at the end of each ...
3
votes
1
answer
890
views
Concurrency of one large kernel with many small kernels and memcopys (CUDA)
I am developing a Multi-GPU accelerated Flow solver. Currently I am trying to implement communication hiding. That means, while data is exchanged the GPU computes the part of the mesh, that is not ...
4
votes
1
answer
2k
views
What is the difference between Nvidia Hyper Q and Nvidia Streams?
I always thought that Hyper-Q technology is nothing but the streams in GPU. Later I found I was wrong(Am I?). So I was doing some reading about Hyper-Q and got confused more.
I was going through one ...
-1
votes
1
answer
1k
views
Why operations in two CUDA Streams are not overlapping?
My program is a pipeline, which contains multiple kernels and memcpys. Each task will go through the same pipeline with different input data. The host code will first chooses a Channel, an ...
6
votes
1
answer
5k
views
What is the relationship between NVIDIA MPS (Multi-Process Server) and CUDA Streams?
Glancing from the official NVIDIA Multi-Process Server docs, it is unclear to me how it interacts with CUDA streams.
Here's an example:
App 0: issues kernels to logical stream 0;
App 1: issues ...
0
votes
1
answer
393
views
Is cuStreamAddCallback as effective as cuStreamSynchronize in having latest bits of data on host?
In CUDA(driver API) documentation, it says
The start of execution of a callback has the same effect as
synchronizing an event recorded in the same stream immediately prior
to the callback. It ...
0
votes
1
answer
600
views
Enqueueing an async copy from a CUDA callback - not permitted?
This program:
#include <string>
#include <stdexcept>
struct buffers_t {
void* host_buffer;
void* device_buffer;
};
void ensure_no_error(std::string message) {
auto status = ...