Backpropagation

Dontloo bio photo By Dontloo

Backpropagation

At training time, a neural network can be seen as a funtion that takes a vector input and outputs a scalar loss, \(l=loss(f(\mathbf{x}))\). Backpropagation is the process of computing the derivatives by applying the chain rule, \[\frac{\partial l}{\partial \mathbf{x}}=\frac{\partial l}{\partial \mathbf{f}}\frac{\partial \mathbf{f}}{\partial \mathbf{x}}\] where \(\mathbf{f}\) is the network predictions, which is usually a vector.

Jacobian matrix

The derivative of a scalar w.r.t a vector is a vector is of the same shape as the input. The derivative of a vector w.r.t a vector is a vector is a matrix known as the Jacobian matrix. So in the above formula, \(\partial l/\partial \mathbf{f}\) is a vector, \(\partial \mathbf{f}/\partial \mathbf{x}\) is a matrix where \((\partial \mathbf{f}/\partial \mathbf{x})_{ij}=\partial \mathbf{f}_i/\partial \mathbf{x}_j\).

Chain Rule for Multivariable Functions

The chain rule applied for multivariable functions is equal to the matrix product of Jacobians. Say the computation dependency between three variables is \(\mathbf{x} \rightarrow \mathbf{h} \rightarrow \mathbf{f}\), by the multivariable chain rule, a single element of the Jacobian matrix \(\partial \mathbf{f}/\partial \mathbf{x}\) is \[(\frac{\partial \mathbf{f}}{\partial \mathbf{x}})_{ij}=\sum_k \frac{\partial \mathbf{f}_i}{\partial \mathbf{h}_k}\frac{\partial \mathbf{h}_k}{\partial \mathbf{x}_j}\] which is the inner product of the \(i\)th row of \(\partial \mathbf{f}/\partial \mathbf{h}\) and the \(j\)th column of \(\partial \mathbf{h}/\partial \mathbf{x}\), which is by definition the product of the two Jacobian matrices.

VJP

For a neural network layer with input \(\mathbf{x}\) and output \(\mathbf{h}\), at backpropagation step, it takes \(\partial l/\partial \mathbf{h}\) as input and outputs \(\partial l/\partial \mathbf{x}\). Since loss \(l\) is a scalar, both input and output of the layer backpropagation function are vectors, and the output is computed by the multiplying the input with the Jacobian matrix \[\frac{\partial l}{\partial \mathbf{x}}=\frac{\partial l}{\partial \mathbf{h}}\frac{\partial \mathbf{h}}{\partial \mathbf{x}}\] which is also known as the vector-Jacobian product (VJP).

Ref: PyTorch Autograd