Newest 'cuda-streams' Questions

0 votes

1 answer

99 views

CPU-GPU producer-consumer pattern using unified memory but GPU is in spin loop

I am trying to implement producer consumer problem in GPU-CPU. Required for some other project. GPU requests some data via Unified memory to CPU. CPU copies that data to a specific location in global ...

Chinmaya Bhat K K

1

asked Sep 30 at 18:38

-1 votes

2 answers

158 views

Are "cuWhateverAsync" calls with stream handle NULL universally equivalent to "cuWhatever" calls?

The CUDA Driver API has multiple calls with an "asynchronous" variant, e.g. cuMemcpy2D and cuMemcpy2DAsync, with the "asynchronous" variant taking a stream handle - and there are ...

einpoklum

137k

asked Oct 20, 2024 at 12:30

1 vote

1 answer

33 views

Some TensorRT conv layer forward blocked by cudaMemcpyAsync from another thread

See this nsys profile: I have observed that during the forward pass of some layers in TensorRT execution, a lock is acquired before launching the kernel. I attempted to determine the specific lock ...

user23864711

21

asked Aug 12, 2024 at 4:23

3 votes

0 answers

215 views

Why is using multiple CUDA streams not improving performance as expected?

I am working on optimizing a CUDA application that processes a matrix by updating each row sequentially. The process involves three main kernels: Kernel 1: Updates the pivot element of the current row....

Photos

61

asked Jun 22, 2024 at 19:57

2 votes

2 answers

818 views

How to Use CUDA Graphs with Interdependent Streams and Dynamic Parameters?

I have a CUDA program with multiple interdependent streams, and I want to convert it to use CUDA graphs to reduce launch overhead and improve performance. My program involves launching three kernels (...

Photos

61

asked Jun 20, 2024 at 7:06

3 votes

0 answers

190 views

Compute and Data transfer not happening concurrently in cuda Streams on Iteration 2

I have written a basic program where a chunk of data is loaded in CPU memory (Pinned), and then I transfer it in chunks to GPUs (Asynchronously), and then do computation on each chunk. So for each ...

Lokesh

31

asked Mar 8, 2024 at 18:38

0 votes

1 answer

154 views

What are the semantics of CUDA kernel launch priorities?

Starting with CUDA 12.0, one can specify a wider variety of "launch attributes", when launching a kernel, using a CUlaunchConfig structure; and one of the attributes we can place in a launch ...

einpoklum

137k

asked Jan 28, 2024 at 22:55

2 votes

1 answer

208 views

Why am I unable to establish a pipeline when using multiple concurrent streams in CUDA programming?

I wish to construct a pipeline using multiple streams. Below is the code I have written: using namespace std; __global__ void vecAdd(float *c, const float *a, const float *b); void initBuffer(float *...

Aitar

23

asked Jun 14, 2023 at 3:15

1 vote

0 answers

398 views

Does a CUDA stream "become active" after execution of a scheduled host function concludes?

The CUDA documentation for scheduling the launching a host function (cuLaunchHostFunc) says: Completion of the function does not cause a stream to become active except as described above. I couldn't ...

einpoklum

137k

asked May 17, 2023 at 18:18

0 votes

1 answer

113 views

What does CU_MEMPOOL_ATTR_REUSE_ALLOW_OPPORTUNISTIC actually allow?

One of the attributes of CUDA memory pools is CU_MEMPOOL_ATTR_REUSE_ALLOW_OPPORTUNISTIC, described in the doxygen as follows: Allow reuse of already completed frees when there is no dependency ...

einpoklum

137k

asked Mar 19, 2023 at 22:36

1 vote

1 answer

788 views

Is there a way to block and unblock a CUDA stream arbitrarily?

I need to pause the execution of all calls in a stream from a certain point in one part of the program until another part of the program decides to unpause this stream at an arbitrary time. This is ...

surabax

15

asked Mar 13, 2023 at 20:14

4 votes

2 answers

2k views

What's the capacity of a CUDA stream (=queue)?

A CUDA stream is a queue of tasks: memory copies, event firing, event waits, kernel launches, callbacks... But - these queues don't have infinite capacity. In fact, empirically, I find that this limit ...

einpoklum

137k

asked Jun 24, 2022 at 19:29

2 votes

2 answers

2k views

Using multi streams in cuda graph, the execution order is uncontrolled

I am using cuda graph stream capture API to implement a small demo with multi streams. Referenced by the CUDA Programming Guide here, I wrote the complete code. In my knowledge, kernelB should execute ...

poohRui

623

asked May 17, 2022 at 3:27

2 votes

1 answer

990 views

How can I pause a CUDA stream and then resume it?

Suppose we have two CUDA streams running two CUDA kernels on a GPU at the same time. How can I pause the CUDA kernel running with the instruction I putting in the host code and resume it with the ...

mehran

213

asked Jan 29, 2022 at 22:05

0 votes

1 answer

717 views

Reusing cudaEvent to serialize multiple streams

Suppose I have a struct: typedef enum {ON_CPU,ON_GPU,ON_BOTH} memLocation; typedef struct foo *foo; struct foo { cudaEvent_t event; float *deviceArray; float *hostArray; ...

Jacob Faib

1,150

asked Mar 5, 2021 at 20:27

0 votes

1 answer

233 views

CUDA cudaMemcpyAsync using single stream to host

I have a single kernel which is feeling data to two parameters (dev_out_1 and dev_out_2) using single stream. I wanted to copy back the data from the device to host in parallel. my requirement is to ...

Yona

25

asked Feb 7, 2021 at 13:59

0 votes

1 answer

417 views

Is it possible to manually set the SMs used for one CUDA stream?

By default, the kernel will use all available SMs of the device (if enough blocks). However, now I have 2 stream with one computational-intense and one memory-intense, and I want to limit the maximal ...

Subject_No_i

33

asked Jun 23, 2020 at 7:36

0 votes

1 answer

398 views

Overlapping transfers and kernel executions in CUDA with two loops

I want to overlap data transfers and kernel executions in a form like this: int numStreams = 3; int size = 10; for(int i = 0; i < size; i++) { cuMemcpyHtoDAsync( _bufferIn1, ...

Eagle06

71

asked Apr 16, 2020 at 21:14

3 votes

0 answers

1k views

Execute another model in parallel to a model's forward pass with PyTorch

I am trying to make some changes to the ResNet-18 model in PyTorch to invoke the execution of another auxiliary trained model which takes in the ResNet intermediate layer output at the end of each ...

jallikattu

31

asked Aug 28, 2019 at 22:05

3 votes

1 answer

890 views

Concurrency of one large kernel with many small kernels and memcopys (CUDA)

I am developing a Multi-GPU accelerated Flow solver. Currently I am trying to implement communication hiding. That means, while data is exchanged the GPU computes the part of the mesh, that is not ...

Lenz

81

asked Jul 16, 2019 at 16:15

4 votes

1 answer

2k views

What is the difference between Nvidia Hyper Q and Nvidia Streams?

I always thought that Hyper-Q technology is nothing but the streams in GPU. Later I found I was wrong(Am I?). So I was doing some reading about Hyper-Q and got confused more. I was going through one ...

sandeep.ganage

1,495

asked May 22, 2019 at 5:18

-1 votes

1 answer

1k views

Why operations in two CUDA Streams are not overlapping?

My program is a pipeline, which contains multiple kernels and memcpys. Each task will go through the same pipeline with different input data. The host code will first chooses a Channel, an ...

StrikeW

533

asked Jan 15, 2019 at 14:47

6 votes

1 answer

5k views

What is the relationship between NVIDIA MPS (Multi-Process Server) and CUDA Streams?

Glancing from the official NVIDIA Multi-Process Server docs, it is unclear to me how it interacts with CUDA streams. Here's an example: App 0: issues kernels to logical stream 0; App 1: issues ...

Covi

1,381

asked Mar 7, 2018 at 23:35

0 votes

1 answer

393 views

Is cuStreamAddCallback as effective as cuStreamSynchronize in having latest bits of data on host?

In CUDA(driver API) documentation, it says The start of execution of a callback has the same effect as synchronizing an event recorded in the same stream immediately prior to the callback. It ...

huseyin tugrul buyukisik

12k

asked Feb 25, 2018 at 17:29

0 votes

1 answer

600 views

Enqueueing an async copy from a CUDA callback - not permitted?

This program: #include <string> #include <stdexcept> struct buffers_t { void* host_buffer; void* device_buffer; }; void ensure_no_error(std::string message) { auto status = ...

einpoklum

137k

asked Nov 1, 2017 at 9:14

Collectives™ on Stack Overflow

CPU-GPU producer-consumer pattern using unified memory but GPU is in spin loop

Are "cuWhateverAsync" calls with stream handle NULL universally equivalent to "cuWhatever" calls?

Some TensorRT conv layer forward blocked by cudaMemcpyAsync from another thread

Why is using multiple CUDA streams not improving performance as expected?

How to Use CUDA Graphs with Interdependent Streams and Dynamic Parameters?

Compute and Data transfer not happening concurrently in cuda Streams on Iteration 2

What are the semantics of CUDA kernel launch priorities?

Why am I unable to establish a pipeline when using multiple concurrent streams in CUDA programming?

Does a CUDA stream "become active" after execution of a scheduled host function concludes?

What does CU_MEMPOOL_ATTR_REUSE_ALLOW_OPPORTUNISTIC actually allow?

Is there a way to block and unblock a CUDA stream arbitrarily?

What's the capacity of a CUDA stream (=queue)?

Using multi streams in cuda graph, the execution order is uncontrolled

How can I pause a CUDA stream and then resume it?

Reusing cudaEvent to serialize multiple streams

CUDA cudaMemcpyAsync using single stream to host

Is it possible to manually set the SMs used for one CUDA stream?

Overlapping transfers and kernel executions in CUDA with two loops

Execute another model in parallel to a model's forward pass with PyTorch

Concurrency of one large kernel with many small kernels and memcopys (CUDA)

What is the difference between Nvidia Hyper Q and Nvidia Streams?

Why operations in two CUDA Streams are not overlapping?

What is the relationship between NVIDIA MPS (Multi-Process Server) and CUDA Streams?

Is cuStreamAddCallback as effective as cuStreamSynchronize in having latest bits of data on host?

Enqueueing an async copy from a CUDA callback - not permitted?

Hot Network Questions