5

I have some Pytorch code which demonstrates the gradient calculation within Pytorch, but I am thoroughly confused what got calculated and how it is used. This post here demonstrates the usage of it, but it does not make sense to me in terms of the back propagation algorithm. Looking at the gradient of in1 and in2 in the example below, I realized the gradient of in1 and in2 is the derivative of the loss function but my understanding is that the update needs to also account for the actual loss value as well? Where is the loss value getting used? Am I missing something here?

in1 = torch.randn(2,2,requires_grad=True)
in2 = torch.randn(2,2,requires_grad=True)
target = torch.randn(2,2)
l1 = torch.nn.L1Loss()
l2 = torch.nn.MSELoss()
out1 = l1(in1,target)
out2 = l2(in2,target)
out1.backward()
out2.backward()
in1.grad
in2.grad

1 Answer 1

7

Backpropagation is based on the chain-rule for calculating derivatives. This means the gradients are computed step-by-step from tail to head and always passed back to the previous step ("previous" w.r.t. to the preceding forward pass).

For scalar output the process is initiated by assuming a gradient of d (out1) / d (out1) = 1 to start the process. If you're calling backward on a (non-scalar) tensor though you need to provide the initial gradient since it is not unambiguous.

Let's look at an example that involves more steps to compute the output:

a = torch.tensor(1., requires_grad=True)
b = a**2
c = 5*b
c.backward()
print(a.grad)  # Prints: 10.

So what happens here?

  1. The process is initiated by using d(c)/d(c) = 1.
  2. Then the previous gradient is computed as d(c)/d(b) = 5 and multiplied with the downstream gradient (1 in this case), i.e. 5 * 1 = 5.
  3. Again the previous gradient is computed as d(b)/d(a) = 2*a = 2 and multiplied again with the downstream gradient (5 in this case), i.e. 2 * 5 = 10.
  4. Hence we arrive at a gradient value of 10 for the initial tensor a.

Now in effect this calculates d(c)/d(a) and that's all there is to it. It is the gradient of c with respect to a and hence no notion of a "target loss" is used (even if the loss was zero that doesn't mean the gradient has to be; it is up to the optimizer to step into the correct (downhill) direction and to stop once the loss got sufficiently small).

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.