Extracting Meaningful Error Message from 'RuntimeError: CUDA error: device-side assert triggered' on Google Colab in Pytorch

Question

I am experiencing the following error while training a generative network via Pytorch 1.9.0+cu102:

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

While using a Google Colaboratory GPU session. This segment was triggered on either one of these two lines:

running_loss += loss.item()

or

target = target.to(device)

It produces the error on the first line when I am first running the notebook, and the second line each subsequent time I try to run the block. The first error occurs after training for 3 batches. The second error happens on the first batch. I can confirm that the device is cuda0, that device is available, and target is a pytorch tensor. Naturally, I tried to take the advice of the error and run:

!CUDA_LAUNCH_BLOCKING=1

and

os.system('CUDA_LAUNCH_BLOCKING=1')

However, neither of these lines changes the error message. According to a different post, this is because colab is running these lines in a subshell. The error does not occur when running on CPU, and I do not have access to a GPU device besides the GPU on Colab. While this question has been asked in many different forms, no answers are particularly helpful to me because they either recommend passing the aforementioned line, are about a situation fundamentally different from my own (such as training a classifier with an inappropriate number of classes), or recommend a solution which I have already tried, such as resetting the runtime or switching to CPU.

I am hoping to gain insight into the following questions:

Is there a way for me to get a more specific error message? Efforts to set the launch blocking variable have been unsuccessful.
How could it be that I am getting this error on two seemingly very different lines? How could it be that my network trains for 3 batches (it is always 3), but fails on the fourth?
Does this situation remind anyone of an error that they have encountered previously, and have a possible route for ameliorating it given the limited information I can extract?

Brown Philip · Accepted Answer · 2021-07-06 22:20:58Z

12

I was successfully able to get more information about the error by executing:

os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

BEFORE importing torch. This allowed me to get a more detailed traceback and ultimately diagnose the problem as an inappropriate loss function.

answered Jul 6, 2021 at 22:20

Brown Philip

2791 gold badge3 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

tomerpacific · Accepted Answer · 2021-10-05 04:58:45Z

5

This can be mainly due to 2 reasons:

Inconsistency in the number of classes
Wrong input for the loss function

If it's the first one, then see you should get the same error when you change the runtime back to CPU.

In my case, it was the second one. I had used BCE loss, and its input should be between 0 and 1. If it's any other value, this error might appear. So I fixed this by using:

criterion=nn.BCEWithLogitsLoss()

instead of:

criterion=nn.BCELoss()

Oh yeah, and I also used:

CUDA_LAUNCH_BLOCKING = "1"

at the beginning of the code.

edited Oct 5, 2021 at 4:58

tomerpacific

6,85118 gold badges43 silver badges61 bronze badges

answered Oct 4, 2021 at 19:41

Keerthi

511 silver badge2 bronze badges

Collectives™ on Stack Overflow

Extracting Meaningful Error Message from 'RuntimeError: CUDA error: device-side assert triggered' on Google Colab in Pytorch

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related