Softmax

Normalization functions

There are many ways of doing normalization, in this article we focus on the following type \[y=f(x) \text{ s.t. } y_i\geq 0, \sum y_i=1. \] Such normalization functions are widely used in the machine learning field, for example, to represent discrete probability distributions.

How to construct a normalization function

A straightforward approach is

Map input from real numbers to non-negative numbers (to ensure \(y_i\geq 0\))
Divide by the sum (to ensure \(\sum y_i=1\))

Softmax

Softmax is a good example of this type

The exponentiation \(e^x\) maps \(x\) from real numbers to positive numbers
Divide by \(\sum e^{x_i}\)

Then we have Softmax \[y_i = \frac{e^{x_i}}{\sum e^{x_j}}\]

Why the name Softmax

Suppose the max function returns a one-hot vector with one at the index of maximum value and zero elsewhere, for example \[x=[-2,-1], \quad max(x)=[0,1].\] Softmax will return a “softened” version where the index of maximum value still has the largest value but is smaller than one
\[softmax(x)=[0.269, 0.731].\]

Why Softmax

People use Softmax because it is the canonical link function for the cross entropy loss, which basically means the derivative for the duo is nice, ref: Derivative of Softmax loss function

\[L =\sum t_i\log y_i = \sum t_i\log\frac{e^{x_i}}{\sum e^{x_j}} = \sum t_i (x_i - \log\sum e^{x_j})\] \[\frac{\partial L}{\partial x_i} = y_i - t_i.\]

For better numerical stability, the operation \(\log\sum e^{x}\) can be implemented using the LogSumExp trick. Otherwise we’ll often need to clip its value to avoid numerical under/overflow, which will lead to inaccurate derivative.

Therefore in practice, some libraries implement Softmax as part of the loss function for the aforementioned benefits in forward and backward passes. The operation of computing \(x_i - \log\sum e^{x_i}\) is often called LogSoftMax.