How to do gradient clipping in pytorch?

Question

What is the correct way to perform gradient clipping in pytorch?

I have an exploding gradients problem.

discuss.pytorch.org/t/proper-way-to-do-gradient-clipping/191 — p13rr0m
– p13rr0m, Commented Feb 15, 2019 at 20:23
@pierrom Thanks. I found that thread myself. Thought that asking here would save everyone who comes after me and googles for a quick answer the hassle of reading through all the discussion (which I haven't finished yet myself), and just getting a quick answer, stackoverflow style. Going to forums to find answers reminds me of 1990. If no one else posts the answer before me, then I will once I find it. — Gulzar
– Gulzar, Commented Feb 15, 2019 at 20:26

Mateen Ulhaq · Accepted Answer · 2022-03-29 02:49:24Z

199

A more complete example from here:

optimizer.zero_grad()        
loss, hidden = model(data, hidden, targets)
loss.backward()

torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip)
optimizer.step()

edited Mar 29, 2022 at 2:49

Mateen Ulhaq

28k22 gold badges122 silver badges157 bronze badges

answered May 10, 2019 at 1:12

Rahul

3,5584 gold badges27 silver badges29 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Gulzar Over a year ago

Why is this more complete? I see the more votes, but don't really understand why this is better. Can you explain please?

Rahul Over a year ago

This simply follows a popular pattern, where one can insert torch.nn.utils.clip_grad_norm_(model.parameters(), args.clip) between the loss.backward() and optimizer.step()

Farhang Amaji Over a year ago

what is args.clip?

Charlie Parker Over a year ago

does it matter if you call opt.zero_grad() before the forward pass or not? My guess is that the sooner it's zeroed out perhaps the sooner MEM freeing happens?

vdi Over a year ago

@FarhangAmaji the max_norm (clipping threshold) value from the args (perhaps from argparse module)

|

Ivan · Accepted Answer · 2021-01-19 13:19:32Z

121

clip_grad_norm (which is actually deprecated in favor of clip_grad_norm_ following the more consistent syntax of a trailing _ when in-place modification is performed) clips the norm of the overall gradient by concatenating all parameters passed to the function, as can be seen from the documentation:

The norm is computed over all gradients together, as if they were concatenated into a single vector. Gradients are modified in-place.

From your example it looks like that you want clip_grad_value_ instead which has a similar syntax and also modifies the gradients in-place:

clip_grad_value_(model.parameters(), clip_value)

Another option is to register a backward hook. This takes the current gradient as an input and may return a tensor which will be used in-place of the previous gradient, i.e. modifying it. This hook is called each time after a gradient has been computed, i.e. there's no need for manually clipping once the hook has been registered:

for p in model.parameters():
    p.register_hook(lambda grad: torch.clamp(grad, -clip_value, clip_value))

edited Jan 19, 2021 at 13:19

Ivan

41.4k9 gold badges78 silver badges120 bronze badges

answered Feb 21, 2019 at 21:28

a_guest

36.8k15 gold badges76 silver badges137 bronze badges

7 Comments

c0mr4t Over a year ago

It is worth mentioning here that these two approaches are NOT equivalent. The latter approach with registering a hook is definitely what most people want. The difference between these two approaches is that the latter approach clips gradients DURING backpropagation and the first approach clips gradients AFTER the entire backpropagation has taken place.

NikSp Over a year ago

And why do we want to clip the gradients DURING backpropagation not AFTER it? Trying to understand why the latter is more desirable than the first.

a_guest Over a year ago

@NikSp If you clip during backpropagation then the clipped gradients propagate to the upstream layers. Otherwise, the raw gradients propagate upstream and this might saturate the gradients for those upstream layers (if clipping would be performed after backpropagation). If the gradients of all layers saturate at the threshold (clip) value this might prevent convergence.

sachinruk Over a year ago

Could you expand on how to make sure the latter does l2 norm clipping. It currently looks like it is simply clipping the absolute value of individual elements. Also does register_hook work only on gradients? Because I would have expected something like param.grad. TIA.

Shiania White Over a year ago

While registering a hook is a fine option, it doesn't seem like the hook in the answer is applying a norm clipping. It's clipping the individual elements rather than the norm of the elements of the gradient.

|

Nikita · Accepted Answer · 2020-03-31 11:02:00Z

18

Reading through the forum discussion gave this:

clipping_value = 1 # arbitrary value of your choosing
torch.nn.utils.clip_grad_norm(model.parameters(), clipping_value)

I'm sure there is more depth to it than only this code snippet.

edited Mar 31, 2020 at 11:02

Nikita

3433 silver badges8 bronze badges

answered Feb 15, 2019 at 20:56

Gulzar

29k42 gold badges161 silver badges262 bronze badges

Comments

hkchengrex · Accepted Answer · 2022-10-11 22:50:08Z

And if you are using Automatic Mixed Precision (AMP), you need to do a bit more before clipping as AMP scales the gradient:

optimizer.zero_grad()
loss = model(data, targets)
scaler.scale(loss).backward()

# Unscales the gradients of optimizer's assigned params in-place
scaler.unscale_(optimizer)

# Since the gradients of optimizer's assigned params are unscaled, clips as usual:
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm)

# optimizer's gradients are already unscaled, so scaler.step does not unscale them,
# although it still skips optimizer.step() if the gradients contain infs or NaNs.
scaler.step(optimizer)

# Updates the scale for next iteration.
scaler.update()

Reference: https://pytorch.org/docs/stable/notes/amp_examples.html#gradient-clipping

JoyfulPanda · Accepted Answer · 2024-08-08 22:20:42Z

For completeness, one may wonder how to determine the hyper-parameter max_norm in function torch.nn.utils.clip_grad_norm_().

This hyper-parameter is also called threshold, which decides if the clipping should be carried out. (Paper: On the difficulty of training Recurrent Neural Networks, Pascanu et al., 2012, Page 6)

Its value can be determined as:

One good heuristic for setting this threshold is to look at statistics on the average norm over a sufficiently large number of updates. In our experience values from half to ten times this average can still yield convergence, though convergence speed can be affected.

The update likely means one processing of a data batch. In summary, at first, we do not use clip_grad_norm_(), just run as usual. For each update, we collect the norm. After a sufficiently large number of updates, we calculate the average of norms, and multiply with any number from [0.5-10]. And that would be the max_norm.

Code Implementation:

from torch import linalg as LA
...
loss.backward()
grads = [p.grad for p in model.parameters() if p.grad is not None]
v = LA.vector_norm(grads, ord=2)
# Save v to somewhere for later analysis.
optimizer.step()

Probably better to use the median if you expect outlier (very high) gradients due to noisy labels or edge cases.

Collectives™ on Stack Overflow

How to do gradient clipping in pytorch?

5 Answers 5

6 Comments

7 Comments

Comments

Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

6 Comments

7 Comments

Comments

Comments

1 Comment

Linked

Related