3
$\begingroup$

I have been trying to learn about neural networks, and I don't understand the backpropagation algorithm. I will state what I understood so far in a formal manner and point out where I don't understand. Please correct me if I'm wrong !

Let $S$, be an integer. Let $I$ be a subset of $[0,1]^S$ (the inputs) and $O$ be a finite totally ordered set (the outputs). Let $f : I \rightarrow O$ be a map (the one that we want the neural network to "learn").

Define $resp : [0,1]^O \rightarrow O$ by $\forall x = (x_o)_{o \in O}$, $resp(x) := \min\{o \ \vert \ x_o = \max\{x(o') \ \vert \ o' \in O\}\}$ (which chooses the most activated output). If $o \in O$, let $\delta_{o}$ be the map sending $o$ to $1$ and all other $o'$ to $0$.

Let $\sigma$ be a sigmoid-like function.

The goal is to find $M \in L(\mathbb{R}^S,\mathbb{R}^O)$ such that for "most" (if possible, for all) $i \in I$, $f(i) = (resp \circ (\sigma \times ... \times \sigma) \circ M) (i)$. Let $I’ \subseteq I$ be a finite subset. The elements of $I’$ are called example inputs.

To that purpose, define $c : [0,1]^O \times [0,1]^O \rightarrow \mathbb{R}$ by $\forall (a,b) \in [0,1]^O \times [0,1]^O$, $c(a,b) = \sum_{o \in O} (a(o) - b(o))^2$ (in general, we can take other $c$, but I restrict my attention to this function).

The goal is to find some $M$ minimizing $\phi_i : M \mapsto c((\sigma \times ... \times \sigma) \circ M)(i),\delta_{f(i)})$ globally for all $i$.

If $\psi : \mathbb{R}^n \rightarrow \mathbb{R}$ is a differentiable function, and $\mu$ is a real number, and $x \in \mathbb{R}^n$, we call the action of replacing $x$ by $x - \mu \nabla_x \psi$ a step of gradient descent of rate $\mu$.

I will now describe a few algorithms. Some of them may fail to achieve the goal stated above. Can you tell me which of these work, and what their names are ? I must say that, for the moment, I am not interested in questions of local/global minima.

1) « Optimizing after each example »

  • Start with $M_0 \in L(\mathbb{R}^S,\mathbb{R}^O)$.
  • Take $i_1 \in I’$. Replace $M_0$ by $M_1$, obtained by performing a lot of steps of gradient descent of rate $\mu$ for the function $M \mapsto c((\sigma \times ... \times \sigma) \circ M)(i_1),\delta_{f(i_1)})$.
  • Take $i_2 \in I’$ and repeat last step with $i_2$ instead of $i_1$.
  • Take $i_3 \in I’$ and repeat last step with $i_3$ instead of $i_2$.
  • Take $i_q$ to be the last element of $I’$ and repeat last step with $i_q$ instead of $i_{q-1}$.

I expect this algorithm not to do what we want : indeed, optimize at step $2$ (for example $i_2$) might « de-optimize » what have been done at step $1$ (for example $i_1$).

2) « Taking one step after each example »

  • Start with $M_0 \in L(\mathbb{R}^S,\mathbb{R}^O)$.
  • Take $i_1 \in I’$. Replace $M_0$ by $M_1$, obtained by performing one step of gradient descent of rate $\mu$ for the function $M \mapsto c((\sigma \times ... \times \sigma) \circ M)(i_1),\delta_{f(i_1)})$.
  • Take $i_2 \in I’$ and repeat last step with $i_2$ instead of $i_1$.
  • Take $i_3 \in I’$ and repeat last step with $i_3$ instead of $i_2$.
  • Take $i_q$ to be the last element of $I’$ and repeat last step with $i_q$ instead of $i_{q-1}$.

I do not know if this algorithm does what we want.

3) « Studying a whole bunch of examples »

  • Start with $M_0 \in L(\mathbb{R}^S,\mathbb{R}^O)$.
  • Make a lot of steps of gradient descent of rate $\mu$ for the function $E := M \mapsto \sum_{i \in I’} c((\sigma \times ... \times \sigma) \circ M)(i),\delta_{f(i)})$.

I expect this algorithm to do what we want, but if it does, I don’t know yet how to perform it in an efficient manner.

I am also looking for any reference on neural networks which is written for mathematicians.

A large part of the question has been rewritten.

$\endgroup$
9
  • $\begingroup$ In general, there is no guarantee that loss decreases with every step in gradient descent (in fact, it is quite common that it doesn't). For instance, if you are very close to the global minimum but your step size is too large, then you will go too far past the global minimum and start climbing up the graph of the loss function on the other side (imagine a parabola). $\endgroup$ Commented Dec 28, 2017 at 13:48
  • $\begingroup$ I also don't know a good reference for mathematicians per se, but the book An Introduction to Statistical Learning is a good overview of basic Machine Learning/Statistical Learning techniques that gives a good abstract perspective on the problem in the first chapter, and there is a set of notes by Goodfellow et. al that goes into a good amount of detail. $\endgroup$ Commented Dec 28, 2017 at 14:40
  • $\begingroup$ One confusion you seem to have is how the loss function is minimized over the examples. You don't minimize all $\phi_i$ individually, you fix the sample set and devise a cost function like the one I describe below which sums over the loss of each example, and minimize that cost. This way you find a function that does as well as possible on all the examples in your training set at once. $\endgroup$ Commented Dec 28, 2017 at 14:56
  • $\begingroup$ Let me also point out that your main question seems to be more about gradient descent than backprop. I've tailored the answer below according to that. $\endgroup$ Commented Dec 28, 2017 at 15:02
  • $\begingroup$ Answer to your first comment : yes, I am aware of this fact, this is why I said "for $\mu$ small enough". $\endgroup$ Commented Dec 28, 2017 at 15:19

1 Answer 1

4
$\begingroup$

Allow me to use slightly different notation (I think you are having some confusion) and abstract the problem slightly. If you want a more general idea mathematically of what gradient-descent and backpropogation is, here goes (I will answer your bold question by the end).

Let's suppose our inputs are vectors $\mathbf{x} \in \mathbb{R}^n$ and our outputs are vectors $\mathbf{y} \in \mathbb{R}^N$ (note that you may want to restrict the codomain of your function to $[0,1]^N$ if you are doing logistic regression. This is a specific case of what I describe here)

Abstractly, in machine learning, we choose a space of functions $\mathbb{R}^n \to \mathbb{R}^N$ mapping inputs to outputs, and then find the best function in that space to meet our goal (minimizing a cost function). The space of functions we are considering is always determined by an assumption about the structure of the functions (e.g. we can choose a neural network with so many layers and so many nodes per layer etc.).

Let's abstract this and say that our function space is parameterized by a vector $\mathbf{w} \in \mathbb{R}^{W}$. i.e. there is a function $$f(-,-) \colon \mathbb{R}^n\times \mathbb{R}^W \to \mathbb{R}^N.$$ Fixing $\mathbf{w}$ is the same as picking out a function $$f(-,\mathbf{w}) \colon \mathbb{R}^n \to \mathbb{R}^N$$ in our function space. If we are considering a certain neural network architecture, the $w$'s will be the weights in all the hidden layers.

Now how do we choose the best $\mathbf{w}$? First, you have to choose a loss function, and a sample set, which is a list of $m$ examples of inputs and desired outputs $(\mathbf{x}_i,\mathbf{y}_i)$. Abstractly, the loss function is a function $$ L \colon \mathbb{R}^N \times \mathbb{R}^N \to \mathbb{R}$$ and we want to solve the minimizization problem $$\mathbf{w}^0 = \text{argmin}_{\mathbf{w}\in \mathbb{R}^W} E(\mathbf{w}) = \text{argmin}_{\mathbf{w}\in \mathbb{R}^W} \sum_{i=1}^m L(f(\mathbf{x}_i,\mathbf{w}),\mathbf{y}_i).$$ i.e. we want to find $\mathbf{w}^0$ so that the function $f(-,\mathbf{w}^0)$ minimizes loss on the example set. (here $E$ is just notation for sum over examples of the loss).

Now, there is a general algorithm for trying to find $\mathbf{w}^{0}$ called gradient descent. It goes like this:

  • fix a starting point $\mathbf{w}^1 \in \mathbb{R}^W$.
  • At each step, update with the rule $$ \mathbf{w}^{j+1} = \mathbf{w}^j - \varepsilon \nabla E(\mathbf{w}^j)$$ (where $\varepsilon$ is some constant you choose beforehand).
  • Stop after a stopping condition is reached (e.g. some specified number $j=1000$).

So what is Backpropagation?

Well, in the algorithm above we wanted to compute $$\nabla E(\mathbf{w}) = \left( \frac{dE}{dw_1}(\mathbf{w}), \ldots , \frac{dE}{dw_W}(\mathbf{w})\right)$$ so that we can update each weight $w_i^j$ according to the rule $$w_i^{j+1} = w^j_i - \varepsilon \frac{dE}{dw_i}(\mathbf{w}^j).$$

How do we compute these derivatives? It depends on the function $f$, but in general, when $f$ has the structure of a neural network, you have to use the chain rule. More generally, if you know the "computational graph" of $f$ you can compute these derivatives using the chain rule as follows.

Example: Suppose we are considering a function space of functions of the form

$$f(\mathbf{x},\mathbf{w}= (\mathbf{u},\mathbf{v})) = g_2(g_1(\mathbf{x},\mathbf{u}), \mathbf{v})$$ (this could be a neural network with one hidden layer, where $g_i$ is linear + activation). Here $\mathbb{R}^W = \mathbb{R}^U \times \mathbb{R}^V$ and $\mathbf{w}= (\mathbf{u},\mathbf{v})$. Thus

$$ E(\mathbf{u},\mathbf{v}) = \sum_1^m L(g_2(g_1(\mathbf{x}_i,\mathbf{u}), \mathbf{v}), \mathbf{y}_i)$$

By the chain rule,

$$\frac{dE}{dv_j}(\mathbf{u},\mathbf{v},\mathbf{x},\mathbf{y}) = \sum_1^m \frac{dL}{dg_2}(\mathbf{u},\mathbf{v},\mathbf{x}_i,\mathbf{y}_i)\frac{dg_2}{dv_j}(\mathbf{u},\mathbf{v},\mathbf{x}_i)$$ $$\frac{dE}{du_j}(\mathbf{u},\mathbf{v},\mathbf{x},\mathbf{y}) = \sum_1^m \frac{dL}{dg_2}(\mathbf{u},\mathbf{v},\mathbf{x}_i,\mathbf{y}_i)\frac{dg_2}{dg_1}(\mathbf{u},\mathbf{v},\mathbf{x}_i)\frac{dg_1}{du_j}(\mathbf{u},\mathbf{x}_i)$$ Computing these derivatives is what backpropagation is doing, but backpropation is a bit more than just the chain rule, it is an algorithmic way of implementing the chain rule (and, importantly, it is optimized for computational speed even if you have $m$ very very large). This blog post sheds some light on it: https://timvieira.github.io/blog/post/2017/08/18/backprop-is-not-just-the-chain-rule/

Tuning gradient descent

There are several caveats to be aware of when working with gradient descent:

  • To answer your question in bold above, there is no guarantee that the function $E$ descreases with each step of gradient descent. Oftentimes it won't, especially near a local minimum.
  • In general, the types of loss functions used in state of the art machine learning algorithms are non-convex, which means that there isn't theory guaranteeing that gradient descent converges to a global minimum. For applications this is ok, but the problem of getting close to the global minimum using gradient descent becomes a bit of an engineering problem.
  • Moreover, a priori, there is no guarantee that a global minimum even exists.
  • The factor $\varepsilon$ chosen above controls to some extent the magnitude of the steps that gradient descent takes. If it is very big, there is a chance your algorithm will approach the global minimum and then shoot past it. If it is too small, your convergence will be very slow, or you might get stuck in a region where the loss function is relatively flat (people call these plateaus). There are various sophisticated versions of gradient descent which mitigate some of these issues. A simple one is to decay the constant $\varepsilon$ as you iterate, so that near the minimum you don't overshoot. Others like "momentum" keep you moving near plateaus so you don't get stuck.
  • Weird things can happen in very deep neural networks that cause your gradients to explode or go to zero, which means that gradient descent will go wild, or stop. People put a lot of work into fiddling with their algorithms to avoid this.
  • It should be unclear in what I wrote above what an ideal "stopping condition" for gradient descent is. How many steps should you take? Should you choose a cut-off threshold error instead?
  • Above, I summed the loss function over all the examples in the training set. Versions of gradient-descent called mini-batch gradient descent and stochastic gradient descent choose only a subset of the training set for computing $\nabla E$ at each iteration. This has advantages for computational time and convergence.
$\endgroup$
14
  • $\begingroup$ Ok, so according to my notation, you are saying that I should define $E := \sum^m_{i=1} \phi_i$ and apply gradient descent to (hope to) find a minimum of $E$. I thought it would have benn computationally very slow. Do you have a reference about the mini-batch method ? $\endgroup$ Commented Dec 28, 2017 at 15:37
  • $\begingroup$ It's actually not that slow because computing E and backprop is done using vectorization (basically, replace for loops with a big matrix multiplication, which is rather optimized in numpy), but mini-batch does indeed speed it up. If you want to get a better feel for the basics, I'd recommend watching videos from Coursera's deeplearning.ai. In particular: vectorization: coursera.org/learn/neural-networks-deep-learning/lecture/NYnog/… mini-batch gradient descent: coursera.org/learn/deep-neural-network/lecture/qcogH/… $\endgroup$ Commented Dec 28, 2017 at 15:45
  • $\begingroup$ I am not a "feeling" person, and requested a reference "for mathematicians" because I don't understand when it is not formal enough. $\endgroup$ Commented Dec 28, 2017 at 16:07
  • $\begingroup$ You can certainly extract something formal and rigorous from those videos, it just requires a little work on your end. $\endgroup$ Commented Dec 28, 2017 at 16:14
  • $\begingroup$ Sure. And on the other side, I don't want to pay for a video :p $\endgroup$ Commented Dec 28, 2017 at 16:19

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.