I’m using the PyTorch profiler to analyze sglang, and I noticed that in the CUDA timeline, some kernels show “Command Buffer Full”. This causes the cudaLaunchKernel time to become very long, as shown in the attached screenshot.
I would like to understand:
Why does the Command Buffer Full occur? Is it because the GPU limits the maximum number of concurrently running kernels?
If that’s the case, does PyTorch provide any information when launchKernel encounters a command buffer full situation?
How can I check the GPU’s command buffer information, such as maximum capacity and remaining capacity?
