17

I am running into the following error when trying to train this on this dataset.

Since this is the configuration published in the paper, I am assuming I am doing something incredibly wrong.

This error arrives on a different image every time I try to run training.

C:/w/1/s/windows/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:106: block: [0,0,0], thread: [6,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2019.1.1\helpers\pydev\pydevd.py", line 1741, in <module>
    main()
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2019.1.1\helpers\pydev\pydevd.py", line 1735, in main
    globals = debugger.run(setup['file'], None, None, is_module)
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2019.1.1\helpers\pydev\pydevd.py", line 1135, in run
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2019.1.1\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "C:/Noam/Code/vision_course/hopenet/deep-head-pose/code/original_code_augmented/train_hopenet_with_validation_holdout.py", line 187, in <module>
    loss_reg_yaw = reg_criterion(yaw_predicted, label_yaw_cont)
  File "C:\Noam\Code\vision_course\hopenet\venv\lib\site-packages\torch\nn\modules\module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "C:\Noam\Code\vision_course\hopenet\venv\lib\site-packages\torch\nn\modules\loss.py", line 431, in forward
    return F.mse_loss(input, target, reduction=self.reduction)
  File "C:\Noam\Code\vision_course\hopenet\venv\lib\site-packages\torch\nn\functional.py", line 2204, in mse_loss
    ret = torch._C._nn.mse_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction))
RuntimeError: reduce failed to synchronize: cudaErrorAssert: device-side assert triggered

Any ideas?

10
  • Can you please run it again on CPU. Often the error message is much clearer without a GPU. Commented Feb 2, 2020 at 6:34
  • 1
    This kind of error generally occurs when using NLL loss or Cross Entropy loss, and when your dataset has negative labels (or labels greater than number of classes). That is also the exact error you are getting Assertion `t >= 0 && t < n_classes` failed Commented Feb 2, 2020 at 18:18
  • @akshayk07 This happens on loss_reg_yaw = reg_criterion(yaw_predicted, label_yaw_cont) where reg_criterion = nn.MSELoss().cuda(gpu), so I really don't understand what can cause this Commented Feb 2, 2020 at 22:56
  • 1
    @akshayk07 Yes, there is a Cross Entropy loss, but somehow the error only appears on other lines. I guess some asynchronous code causes this to die on unrelated lines. You can see the code in the link in the question, it is quite short. Commented Feb 6, 2020 at 14:09
  • 1
    For anyone else reading this, not for this particular paper, this actually happened because the train data had tags outside of the bounds of the cross entropy n_classes, as @akshayk07 states correctly in his answer Commented May 3, 2020 at 14:34

2 Answers 2

24

This kind of error generally occurs when using NLLLoss or CrossEntropyLoss, and when your dataset has negative labels (or labels greater than the number of classes). That is also the exact error you are getting Assertion t >= 0 && t < n_classes failed.

This won't occur for MSELoss, but OP mentions that there is a CrossEntropyLoss somewhere and thus the error occurs (the program crashes asynchronously on some other line). The solution is to clean the dataset and ensure that t >= 0 && t < n_classes is satisfied (where t represents the label).

Also, ensure that your network output is in the range 0 to 1 in case you use NLLLoss or BCELoss (then you require softmax or sigmoid activation respectively). Note that this is not required for CrossEntropyLoss or BCEWithLogitsLoss because they implement the activation function inside the loss function. (Thanks to @PouyaB for pointing out).

Sign up to request clarification or add additional context in comments.

5 Comments

Can you explain how come it crashes on other lines?
It's difficult to say. All of the other issues/questions I have seen get the error on the NLL/cross entropy line. Maybe you could open an issue on the PyTorch GitHub and the devs can help you find some explanation.
It is possible that your code crashes on other lines because the cuda kernels are not synchronized. Thus, the main kernel may think it is evaluating another line of code whereas it is really waiting for cuda kernels to finish their tasks and synchronize, so when these cuda kernels fail, the debugger indicates that the error occurred at a line different from the actual error source. Again, this is just a theory
The same error is raised also when using BCELoss and the input is not in the expected range (0-1). In my case I had a missing nn.Sigmoid() on the output of the network. Adding the nonlinearity fixed the issue.
Thanks so much. I've been struggling with this error for hours; been doing instance segmentation and did not count the background in the number of classes; hence, the labels max value did not match.
-2

I had a similar error. I was classifying two classes with labels 1 and 2. Changing the labels to 0 and 1 solved the issue.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.