Skip to main content
Filter by
Sorted by
Tagged with
0 votes
1 answer
99 views

I am trying to implement producer consumer problem in GPU-CPU. Required for some other project. GPU requests some data via Unified memory to CPU. CPU copies that data to a specific location in global ...
Chinmaya Bhat K K's user avatar
-1 votes
2 answers
158 views

The CUDA Driver API has multiple calls with an "asynchronous" variant, e.g. cuMemcpy2D and cuMemcpy2DAsync, with the "asynchronous" variant taking a stream handle - and there are ...
einpoklum's user avatar
  • 137k
1 vote
1 answer
33 views

See this nsys profile: I have observed that during the forward pass of some layers in TensorRT execution, a lock is acquired before launching the kernel. I attempted to determine the specific lock ...
user23864711's user avatar
3 votes
0 answers
215 views

I am working on optimizing a CUDA application that processes a matrix by updating each row sequentially. The process involves three main kernels: Kernel 1: Updates the pivot element of the current row....
Photos's user avatar
  • 61
2 votes
2 answers
818 views

I have a CUDA program with multiple interdependent streams, and I want to convert it to use CUDA graphs to reduce launch overhead and improve performance. My program involves launching three kernels (...
Photos's user avatar
  • 61
3 votes
0 answers
190 views

I have written a basic program where a chunk of data is loaded in CPU memory (Pinned), and then I transfer it in chunks to GPUs (Asynchronously), and then do computation on each chunk. So for each ...
Lokesh's user avatar
  • 31
0 votes
1 answer
154 views

Starting with CUDA 12.0, one can specify a wider variety of "launch attributes", when launching a kernel, using a CUlaunchConfig structure; and one of the attributes we can place in a launch ...
einpoklum's user avatar
  • 137k
2 votes
1 answer
208 views

I wish to construct a pipeline using multiple streams. Below is the code I have written: using namespace std; __global__ void vecAdd(float *c, const float *a, const float *b); void initBuffer(float *...
Aitar's user avatar
  • 23
1 vote
0 answers
398 views

The CUDA documentation for scheduling the launching a host function (cuLaunchHostFunc) says: Completion of the function does not cause a stream to become active except as described above. I couldn't ...
einpoklum's user avatar
  • 137k
0 votes
1 answer
113 views

One of the attributes of CUDA memory pools is CU_MEMPOOL_ATTR_REUSE_ALLOW_OPPORTUNISTIC, described in the doxygen as follows: Allow reuse of already completed frees when there is no dependency ...
einpoklum's user avatar
  • 137k
1 vote
1 answer
788 views

I need to pause the execution of all calls in a stream from a certain point in one part of the program until another part of the program decides to unpause this stream at an arbitrary time. This is ...
surabax's user avatar
  • 15
4 votes
2 answers
2k views

A CUDA stream is a queue of tasks: memory copies, event firing, event waits, kernel launches, callbacks... But - these queues don't have infinite capacity. In fact, empirically, I find that this limit ...
einpoklum's user avatar
  • 137k
2 votes
2 answers
2k views

I am using cuda graph stream capture API to implement a small demo with multi streams. Referenced by the CUDA Programming Guide here, I wrote the complete code. In my knowledge, kernelB should execute ...
poohRui's user avatar
  • 623
2 votes
1 answer
990 views

Suppose we have two CUDA streams running two CUDA kernels on a GPU at the same time. How can I pause the CUDA kernel running with the instruction I putting in the host code and resume it with the ...
mehran's user avatar
  • 213
0 votes
1 answer
717 views

Suppose I have a struct: typedef enum {ON_CPU,ON_GPU,ON_BOTH} memLocation; typedef struct foo *foo; struct foo { cudaEvent_t event; float *deviceArray; float *hostArray; ...
Jacob Faib's user avatar
  • 1,150
0 votes
1 answer
233 views

I have a single kernel which is feeling data to two parameters (dev_out_1 and dev_out_2) using single stream. I wanted to copy back the data from the device to host in parallel. my requirement is to ...
Yona's user avatar
  • 25
0 votes
1 answer
417 views

By default, the kernel will use all available SMs of the device (if enough blocks). However, now I have 2 stream with one computational-intense and one memory-intense, and I want to limit the maximal ...
Subject_No_i's user avatar
0 votes
1 answer
398 views

I want to overlap data transfers and kernel executions in a form like this: int numStreams = 3; int size = 10; for(int i = 0; i < size; i++) { cuMemcpyHtoDAsync( _bufferIn1, ...
Eagle06's user avatar
  • 71
3 votes
0 answers
1k views

I am trying to make some changes to the ResNet-18 model in PyTorch to invoke the execution of another auxiliary trained model which takes in the ResNet intermediate layer output at the end of each ...
jallikattu's user avatar
3 votes
1 answer
890 views

I am developing a Multi-GPU accelerated Flow solver. Currently I am trying to implement communication hiding. That means, while data is exchanged the GPU computes the part of the mesh, that is not ...
Lenz's user avatar
  • 81
4 votes
1 answer
2k views

I always thought that Hyper-Q technology is nothing but the streams in GPU. Later I found I was wrong(Am I?). So I was doing some reading about Hyper-Q and got confused more. I was going through one ...
sandeep.ganage's user avatar
-1 votes
1 answer
1k views

My program is a pipeline, which contains multiple kernels and memcpys. Each task will go through the same pipeline with different input data. The host code will first chooses a Channel, an ...
StrikeW's user avatar
  • 533
6 votes
1 answer
5k views

Glancing from the official NVIDIA Multi-Process Server docs, it is unclear to me how it interacts with CUDA streams. Here's an example: App 0: issues kernels to logical stream 0; App 1: issues ...
Covi's user avatar
  • 1,381
0 votes
1 answer
393 views

In CUDA(driver API) documentation, it says The start of execution of a callback has the same effect as synchronizing an event recorded in the same stream immediately prior to the callback. It ...
huseyin tugrul buyukisik's user avatar
0 votes
1 answer
600 views

This program: #include <string> #include <stdexcept> struct buffers_t { void* host_buffer; void* device_buffer; }; void ensure_no_error(std::string message) { auto status = ...
einpoklum's user avatar
  • 137k