What is the correct way to perform gradient clipping in pytorch?
I have an exploding gradients problem.
What is the correct way to perform gradient clipping in pytorch?
I have an exploding gradients problem.
A more complete example from here:
optimizer.zero_grad()
loss, hidden = model(data, hidden, targets)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip)
optimizer.step()
opt.zero_grad() before the forward pass or not? My guess is that the sooner it's zeroed out perhaps the sooner MEM freeing happens?max_norm (clipping threshold) value from the args (perhaps from argparse module)clip_grad_norm (which is actually deprecated in favor of clip_grad_norm_ following the more consistent syntax of a trailing _ when in-place modification is performed) clips the norm of the overall gradient by concatenating all parameters passed to the function, as can be seen from the documentation:
The norm is computed over all gradients together, as if they were concatenated into a single vector. Gradients are modified in-place.
From your example it looks like that you want clip_grad_value_ instead which has a similar syntax and also modifies the gradients in-place:
clip_grad_value_(model.parameters(), clip_value)
Another option is to register a backward hook. This takes the current gradient as an input and may return a tensor which will be used in-place of the previous gradient, i.e. modifying it. This hook is called each time after a gradient has been computed, i.e. there's no need for manually clipping once the hook has been registered:
for p in model.parameters():
p.register_hook(lambda grad: torch.clamp(grad, -clip_value, clip_value))
register_hook work only on gradients? Because I would have expected something like param.grad. TIA.Reading through the forum discussion gave this:
clipping_value = 1 # arbitrary value of your choosing
torch.nn.utils.clip_grad_norm(model.parameters(), clipping_value)
I'm sure there is more depth to it than only this code snippet.
And if you are using Automatic Mixed Precision (AMP), you need to do a bit more before clipping as AMP scales the gradient:
optimizer.zero_grad()
loss = model(data, targets)
scaler.scale(loss).backward()
# Unscales the gradients of optimizer's assigned params in-place
scaler.unscale_(optimizer)
# Since the gradients of optimizer's assigned params are unscaled, clips as usual:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)
# optimizer's gradients are already unscaled, so scaler.step does not unscale them,
# although it still skips optimizer.step() if the gradients contain infs or NaNs.
scaler.step(optimizer)
# Updates the scale for next iteration.
scaler.update()
Reference: https://pytorch.org/docs/stable/notes/amp_examples.html#gradient-clipping
For completeness, one may wonder how to determine the hyper-parameter max_norm in function torch.nn.utils.clip_grad_norm_().
This hyper-parameter is also called threshold, which decides if the clipping should be carried out. (Paper: On the difficulty of training Recurrent Neural Networks, Pascanu et al., 2012, Page 6)
Its value can be determined as:
One good heuristic for setting this threshold is to look at statistics on the average norm over a sufficiently large number of updates. In our experience values from half to ten times this average can still yield convergence, though convergence speed can be affected.
The update likely means one processing of a data batch. In summary, at first, we do not use clip_grad_norm_(), just run as usual. For each update, we collect the norm. After a sufficiently large number of updates, we calculate the average of norms, and multiply with any number from [0.5-10]. And that would be the max_norm.
Code Implementation:
from torch import linalg as LA
...
loss.backward()
grads = [p.grad for p in model.parameters() if p.grad is not None]
v = LA.vector_norm(grads, ord=2)
# Save v to somewhere for later analysis.
optimizer.step()