121

What is the correct way to perform gradient clipping in pytorch?

I have an exploding gradients problem.

2
  • discuss.pytorch.org/t/proper-way-to-do-gradient-clipping/191 Commented Feb 15, 2019 at 20:23
  • 10
    @pierrom Thanks. I found that thread myself. Thought that asking here would save everyone who comes after me and googles for a quick answer the hassle of reading through all the discussion (which I haven't finished yet myself), and just getting a quick answer, stackoverflow style. Going to forums to find answers reminds me of 1990. If no one else posts the answer before me, then I will once I find it. Commented Feb 15, 2019 at 20:26

5 Answers 5

199

A more complete example from here:

optimizer.zero_grad()        
loss, hidden = model(data, hidden, targets)
loss.backward()

torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip)
optimizer.step()
Sign up to request clarification or add additional context in comments.

6 Comments

Why is this more complete? I see the more votes, but don't really understand why this is better. Can you explain please?
This simply follows a popular pattern, where one can insert torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip) between the loss.backward() and optimizer.step()
what is args.clip?
does it matter if you call opt.zero_grad() before the forward pass or not? My guess is that the sooner it's zeroed out perhaps the sooner MEM freeing happens?
@FarhangAmaji the max_norm (clipping threshold) value from the args (perhaps from argparse module)
|
121

clip_grad_norm (which is actually deprecated in favor of clip_grad_norm_ following the more consistent syntax of a trailing _ when in-place modification is performed) clips the norm of the overall gradient by concatenating all parameters passed to the function, as can be seen from the documentation:

The norm is computed over all gradients together, as if they were concatenated into a single vector. Gradients are modified in-place.

From your example it looks like that you want clip_grad_value_ instead which has a similar syntax and also modifies the gradients in-place:

clip_grad_value_(model.parameters(), clip_value)

Another option is to register a backward hook. This takes the current gradient as an input and may return a tensor which will be used in-place of the previous gradient, i.e. modifying it. This hook is called each time after a gradient has been computed, i.e. there's no need for manually clipping once the hook has been registered:

for p in model.parameters():
    p.register_hook(lambda grad: torch.clamp(grad, -clip_value, clip_value))

7 Comments

It is worth mentioning here that these two approaches are NOT equivalent. The latter approach with registering a hook is definitely what most people want. The difference between these two approaches is that the latter approach clips gradients DURING backpropagation and the first approach clips gradients AFTER the entire backpropagation has taken place.
And why do we want to clip the gradients DURING backpropagation not AFTER it? Trying to understand why the latter is more desirable than the first.
@NikSp If you clip during backpropagation then the clipped gradients propagate to the upstream layers. Otherwise, the raw gradients propagate upstream and this might saturate the gradients for those upstream layers (if clipping would be performed after backpropagation). If the gradients of all layers saturate at the threshold (clip) value this might prevent convergence.
Could you expand on how to make sure the latter does l2 norm clipping. It currently looks like it is simply clipping the absolute value of individual elements. Also does register_hook work only on gradients? Because I would have expected something like param.grad. TIA.
While registering a hook is a fine option, it doesn't seem like the hook in the answer is applying a norm clipping. It's clipping the individual elements rather than the norm of the elements of the gradient.
|
18

Reading through the forum discussion gave this:

clipping_value = 1 # arbitrary value of your choosing
torch.nn.utils.clip_grad_norm(model.parameters(), clipping_value)

I'm sure there is more depth to it than only this code snippet.

Comments

16

And if you are using Automatic Mixed Precision (AMP), you need to do a bit more before clipping as AMP scales the gradient:

optimizer.zero_grad()
loss = model(data, targets)
scaler.scale(loss).backward()

# Unscales the gradients of optimizer's assigned params in-place
scaler.unscale_(optimizer)

# Since the gradients of optimizer's assigned params are unscaled, clips as usual:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)

# optimizer's gradients are already unscaled, so scaler.step does not unscale them,
# although it still skips optimizer.step() if the gradients contain infs or NaNs.
scaler.step(optimizer)

# Updates the scale for next iteration.
scaler.update()

Reference: https://pytorch.org/docs/stable/notes/amp_examples.html#gradient-clipping

Comments

6

For completeness, one may wonder how to determine the hyper-parameter max_norm in function torch.nn.utils.clip_grad_norm_().

This hyper-parameter is also called threshold, which decides if the clipping should be carried out. (Paper: On the difficulty of training Recurrent Neural Networks, Pascanu et al., 2012, Page 6)

                                              enter image description here

Its value can be determined as:

One good heuristic for setting this threshold is to look at statistics on the average norm over a sufficiently large number of updates. In our experience values from half to ten times this average can still yield convergence, though convergence speed can be affected.

The update likely means one processing of a data batch. In summary, at first, we do not use clip_grad_norm_(), just run as usual. For each update, we collect the norm. After a sufficiently large number of updates, we calculate the average of norms, and multiply with any number from [0.5-10]. And that would be the max_norm.

Code Implementation:

from torch import linalg as LA
...
loss.backward()
grads = [p.grad for p in model.parameters() if p.grad is not None]
v = LA.vector_norm(grads, ord=2)
# Save v to somewhere for later analysis.
optimizer.step()

1 Comment

Probably better to use the median if you expect outlier (very high) gradients due to noisy labels or edge cases.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.