Pytorch: RuntimeError: reduce failed to synchronize: cudaErrorAssert: device-side assert triggered

Question

I am running into the following error when trying to train this on this dataset.

Since this is the configuration published in the paper, I am assuming I am doing something incredibly wrong.

This error arrives on a different image every time I try to run training.

C:/w/1/s/windows/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:106: block: [0,0,0], thread: [6,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2019.1.1\helpers\pydev\pydevd.py", line 1741, in <module>
    main()
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2019.1.1\helpers\pydev\pydevd.py", line 1735, in main
    globals = debugger.run(setup['file'], None, None, is_module)
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2019.1.1\helpers\pydev\pydevd.py", line 1135, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2019.1.1\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "C:/Noam/Code/vision_course/hopenet/deep-head-pose/code/original_code_augmented/train_hopenet_with_validation_holdout.py", line 187, in <module>
    loss_reg_yaw = reg_criterion(yaw_predicted, label_yaw_cont)
  File "C:\Noam\Code\vision_course\hopenet\venv\lib\site-packages\torch\nn\modules\module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\Noam\Code\vision_course\hopenet\venv\lib\site-packages\torch\nn\modules\loss.py", line 431, in forward
    return F.mse_loss(input, target, reduction=self.reduction)
  File "C:\Noam\Code\vision_course\hopenet\venv\lib\site-packages\torch\nn\functional.py", line 2204, in mse_loss
    ret = torch._C._nn.mse_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction))
RuntimeError: reduce failed to synchronize: cudaErrorAssert: device-side assert triggered

Any ideas?

Can you please run it again on CPU. Often the error message is much clearer without a GPU. — cronoik
– cronoik, Commented Feb 2, 2020 at 6:34
This kind of error generally occurs when using NLL loss or Cross Entropy loss, and when your dataset has negative labels (or labels greater than number of classes). That is also the exact error you are getting Assertion `t >= 0 && t < n_classes` failed — akshayk07
– akshayk07, Commented Feb 2, 2020 at 18:18
@akshayk07 This happens on loss_reg_yaw = reg_criterion(yaw_predicted, label_yaw_cont) where reg_criterion = nn.MSELoss().cuda(gpu), so I really don't understand what can cause this — Gulzar
– Gulzar, Commented Feb 2, 2020 at 22:56
@akshayk07 Yes, there is a Cross Entropy loss, but somehow the error only appears on other lines. I guess some asynchronous code causes this to die on unrelated lines. You can see the code in the link in the question, it is quite short. — Gulzar
– Gulzar, Commented Feb 6, 2020 at 14:09
For anyone else reading this, not for this particular paper, this actually happened because the train data had tags outside of the bounds of the cross entropy n_classes, as @akshayk07 states correctly in his answer — Gulzar
– Gulzar, Commented May 3, 2020 at 14:34

akshayk07 · Accepted Answer · 2020-03-25 05:07:45Z

24

This kind of error generally occurs when using NLLLoss or CrossEntropyLoss, and when your dataset has negative labels (or labels greater than the number of classes). That is also the exact error you are getting Assertion t >= 0 && t < n_classes failed.

This won't occur for MSELoss, but OP mentions that there is a CrossEntropyLoss somewhere and thus the error occurs (the program crashes asynchronously on some other line). The solution is to clean the dataset and ensure that t >= 0 && t < n_classes is satisfied (where t represents the label).

Also, ensure that your network output is in the range 0 to 1 in case you use NLLLoss or BCELoss (then you require softmax or sigmoid activation respectively). Note that this is not required for CrossEntropyLoss or BCEWithLogitsLoss because they implement the activation function inside the loss function. (Thanks to @PouyaB for pointing out).

edited Mar 25, 2020 at 5:07

answered Feb 6, 2020 at 14:30

akshayk07

2,2201 gold badge26 silver badges35 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Gulzar Over a year ago

Can you explain how come it crashes on other lines?

akshayk07 Over a year ago

It's difficult to say. All of the other issues/questions I have seen get the error on the NLL/cross entropy line. Maybe you could open an issue on the PyTorch GitHub and the devs can help you find some explanation.

DerekG Over a year ago

It is possible that your code crashes on other lines because the cuda kernels are not synchronized. Thus, the main kernel may think it is evaluating another line of code whereas it is really waiting for cuda kernels to finish their tasks and synchronize, so when these cuda kernels fail, the debugger indicates that the error occurred at a line different from the actual error source. Again, this is just a theory

PouyaB Over a year ago

The same error is raised also when using BCELoss and the input is not in the expected range (0-1). In my case I had a missing nn.Sigmoid() on the output of the network. Adding the nonlinearity fixed the issue.

Mohd Over a year ago

Thanks so much. I've been struggling with this error for hours; been doing instance segmentation and did not count the background in the number of classes; hence, the labels max value did not match.

Hypatia · Accepted Answer · 2023-11-15 15:33:04Z

-2

I had a similar error. I was classifying two classes with labels 1 and 2. Changing the labels to 0 and 1 solved the issue.

answered Nov 15, 2023 at 15:33

Hypatia

497 bronze badges

Collectives™ on Stack Overflow

Pytorch: RuntimeError: reduce failed to synchronize: cudaErrorAssert: device-side assert triggered

2 Answers 2

5 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Linked

Related