# Backpropagation

### Backpropagation

At training time, a neural network can be seen as a funtion that takes a vector input and outputs a scalar loss, $$l=loss(f(\mathbf{x}))$$. Backpropagation is the process of computing the derivatives by applying the chain rule, $\frac{\partial l}{\partial \mathbf{x}}=\frac{\partial l}{\partial \mathbf{f}}\frac{\partial \mathbf{f}}{\partial \mathbf{x}}$ where $$\mathbf{f}$$ is the network predictions, which is usually a vector.

### Jacobian matrix

The derivative of a scalar w.r.t a vector is a vector is of the same shape as the input. The derivative of a vector w.r.t a vector is a vector is a matrix known as the Jacobian matrix. So in the above formula, $$\partial l/\partial \mathbf{f}$$ is a vector, $$\partial \mathbf{f}/\partial \mathbf{x}$$ is a matrix where $$(\partial \mathbf{f}/\partial \mathbf{x})_{ij}=\partial \mathbf{f}_i/\partial \mathbf{x}_j$$.

### Chain Rule for Multivariable Functions

The chain rule applied for multivariable functions is equal to the matrix product of Jacobians. Say the computation dependency between three variables is $$\mathbf{x} \rightarrow \mathbf{h} \rightarrow \mathbf{f}$$, by the multivariable chain rule, a single element of the Jacobian matrix $$\partial \mathbf{f}/\partial \mathbf{x}$$ is $(\frac{\partial \mathbf{f}}{\partial \mathbf{x}})_{ij}=\sum_k \frac{\partial \mathbf{f}_i}{\partial \mathbf{h}_k}\frac{\partial \mathbf{h}_k}{\partial \mathbf{x}_j}$ which is the inner product of the $$i$$th row of $$\partial \mathbf{f}/\partial \mathbf{h}$$ and the $$j$$th column of $$\partial \mathbf{h}/\partial \mathbf{x}$$, which is by definition the product of the two Jacobian matrices.

### VJP

For a neural network layer with input $$\mathbf{x}$$ and output $$\mathbf{h}$$, at backpropagation step, it takes $$\partial l/\partial \mathbf{h}$$ as input and outputs $$\partial l/\partial \mathbf{x}$$. Since loss $$l$$ is a scalar, both input and output of the layer backpropagation function are vectors, and the output is computed by the multiplying the input with the Jacobian matrix $\frac{\partial l}{\partial \mathbf{x}}=\frac{\partial l}{\partial \mathbf{h}}\frac{\partial \mathbf{h}}{\partial \mathbf{x}}$ which is also known as the vector-Jacobian product (VJP).