I have been trying to learn about neural networks, and I don't understand the backpropagation algorithm. I will state what I understood so far in a formal manner and point out where I don't understand. Please correct me if I'm wrong !
Let $S$, be an integer. Let $I$ be a subset of $[0,1]^S$ (the inputs) and $O$ be a finite totally ordered set (the outputs). Let $f : I \rightarrow O$ be a map (the one that we want the neural network to "learn").
Define $resp : [0,1]^O \rightarrow O$ by $\forall x = (x_o)_{o \in O}$, $resp(x) := \min\{o \ \vert \ x_o = \max\{x(o') \ \vert \ o' \in O\}\}$ (which chooses the most activated output). If $o \in O$, let $\delta_{o}$ be the map sending $o$ to $1$ and all other $o'$ to $0$.
Let $\sigma$ be a sigmoid-like function.
The goal is to find $M \in L(\mathbb{R}^S,\mathbb{R}^O)$ such that for "most" (if possible, for all) $i \in I$, $f(i) = (resp \circ (\sigma \times ... \times \sigma) \circ M) (i)$. Let $I’ \subseteq I$ be a finite subset. The elements of $I’$ are called example inputs.
To that purpose, define $c : [0,1]^O \times [0,1]^O \rightarrow \mathbb{R}$ by $\forall (a,b) \in [0,1]^O \times [0,1]^O$, $c(a,b) = \sum_{o \in O} (a(o) - b(o))^2$ (in general, we can take other $c$, but I restrict my attention to this function).
The goal is to find some $M$ minimizing $\phi_i : M \mapsto c((\sigma \times ... \times \sigma) \circ M)(i),\delta_{f(i)})$ globally for all $i$.
If $\psi : \mathbb{R}^n \rightarrow \mathbb{R}$ is a differentiable function, and $\mu$ is a real number, and $x \in \mathbb{R}^n$, we call the action of replacing $x$ by $x - \mu \nabla_x \psi$ a step of gradient descent of rate $\mu$.
I will now describe a few algorithms. Some of them may fail to achieve the goal stated above. Can you tell me which of these work, and what their names are ? I must say that, for the moment, I am not interested in questions of local/global minima.
1) « Optimizing after each example »
- Start with $M_0 \in L(\mathbb{R}^S,\mathbb{R}^O)$.
- Take $i_1 \in I’$. Replace $M_0$ by $M_1$, obtained by performing a lot of steps of gradient descent of rate $\mu$ for the function $M \mapsto c((\sigma \times ... \times \sigma) \circ M)(i_1),\delta_{f(i_1)})$.
- Take $i_2 \in I’$ and repeat last step with $i_2$ instead of $i_1$.
- Take $i_3 \in I’$ and repeat last step with $i_3$ instead of $i_2$.
- …
- Take $i_q$ to be the last element of $I’$ and repeat last step with $i_q$ instead of $i_{q-1}$.
I expect this algorithm not to do what we want : indeed, optimize at step $2$ (for example $i_2$) might « de-optimize » what have been done at step $1$ (for example $i_1$).
2) « Taking one step after each example »
- Start with $M_0 \in L(\mathbb{R}^S,\mathbb{R}^O)$.
- Take $i_1 \in I’$. Replace $M_0$ by $M_1$, obtained by performing one step of gradient descent of rate $\mu$ for the function $M \mapsto c((\sigma \times ... \times \sigma) \circ M)(i_1),\delta_{f(i_1)})$.
- Take $i_2 \in I’$ and repeat last step with $i_2$ instead of $i_1$.
- Take $i_3 \in I’$ and repeat last step with $i_3$ instead of $i_2$.
- …
- Take $i_q$ to be the last element of $I’$ and repeat last step with $i_q$ instead of $i_{q-1}$.
I do not know if this algorithm does what we want.
3) « Studying a whole bunch of examples »
- Start with $M_0 \in L(\mathbb{R}^S,\mathbb{R}^O)$.
- Make a lot of steps of gradient descent of rate $\mu$ for the function $E := M \mapsto \sum_{i \in I’} c((\sigma \times ... \times \sigma) \circ M)(i),\delta_{f(i)})$.
I expect this algorithm to do what we want, but if it does, I don’t know yet how to perform it in an efficient manner.
I am also looking for any reference on neural networks which is written for mathematicians.
A large part of the question has been rewritten.