I am experiencing the following error while training a generative network via Pytorch 1.9.0+cu102:
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
While using a Google Colaboratory GPU session. This segment was triggered on either one of these two lines:
running_loss += loss.item()
or
target = target.to(device)
It produces the error on the first line when I am first running the notebook, and the second line each subsequent time I try to run the block. The first error occurs after training for 3 batches. The second error happens on the first batch. I can confirm that the device is cuda0, that device is available, and target is a pytorch tensor. Naturally, I tried to take the advice of the error and run:
!CUDA_LAUNCH_BLOCKING=1
and
os.system('CUDA_LAUNCH_BLOCKING=1')
However, neither of these lines changes the error message. According to a different post, this is because colab is running these lines in a subshell. The error does not occur when running on CPU, and I do not have access to a GPU device besides the GPU on Colab. While this question has been asked in many different forms, no answers are particularly helpful to me because they either recommend passing the aforementioned line, are about a situation fundamentally different from my own (such as training a classifier with an inappropriate number of classes), or recommend a solution which I have already tried, such as resetting the runtime or switching to CPU.
I am hoping to gain insight into the following questions:
- Is there a way for me to get a more specific error message? Efforts to set the launch blocking variable have been unsuccessful.
- How could it be that I am getting this error on two seemingly very different lines? How could it be that my network trains for 3 batches (it is always 3), but fails on the fourth?
- Does this situation remind anyone of an error that they have encountered previously, and have a possible route for ameliorating it given the limited information I can extract?