Conditional Random Fields Summary

The Big Picture

A CRF is defined as (ref: Machine Learning: A Probabilistic Perspective) $p(y|x,w) = \frac{1}{Z(x,w)}\prod_{c}\psi_c(y_c|x,w).$

Typically there are three things we’d like to do with CRFs as with HMMs, which are
evaluation: $$p(y|x,w)$$
learning: $$\text{arg}\max\limits_{w} p(y|x,w)$$ (MLE, MAP, or max-margin learning)
decoding: $$\text{arg}\max\limits_{y} p(y|x,w)$$ subject to $$p(y^{max}|x,w)=\max\limits_{y}p(y|x,w)$$ (i.e. to find the set of values $$y$$ that are individually the most probable)

Potentials (Energy Functions)

The potentials $$\psi$$ are usually defined in the log-linear form of energy functions as, $\psi(y|x,w)=\exp(-E(x,y))=\exp(w^T\phi(x,y))$ where $$w$$ are model parameters and $$\phi(x,y)$$ are called the features (or sufficient statistics functions).

For example, to construct a Gaussian like potential, we can define $$\phi(x,y)=(x-y)^2$$, $$w=-1/(2\sigma^2)$$, so $\psi(y|x,w)=\exp(-\frac{(x-y)^2}{2\sigma^2}).$

If the hidden variable $$y$$ is categorical, it often only serves as an indicator in feature functions. For example a simple chain structured CRF can be written as $p(y|x)=\frac{1}{Z(x)}\exp(\sum_{j=1}^{m}\langle w_{y_j},\phi_j(x_j)\rangle + \sum_{j=1}^{m-1}T_{y_j, y_{j+1}}).$

Moreover, the feature function $$\phi_c$$ can also be learned by say, a neural network. More information about feature functions and the log-linear family can be found in Chapter 8, Probabilistic Graphical Models.

Dynamic Programming

The difficulty often arises in the evaluating the normalizer $$Z(x)$$ and related gradients, for chain structured CRFs, there’s a way to evaluate $$Z(x)$$ and the corresponding gradient terms efficiently, which is known as the forward-backward algorithm. There are variant versions of the forward-backward algorithm in different contexts (e.g. for doing the E step of EM for HMMs) but the key concept is the same, which is by applying dynamic programming in both the forward and backward directions to compute the marginals.

Dynamic programming in this case can be simply thought of as the distributive rule reversed: $ab + ac = a(b + c)$ By taking the $$a$$ out of the parenthesis we are converting an exponential number of computations into polynomial. Dynamic programming can also be applied to the decoding task in a similar manner as the Viterbi algorithm for HMMs.

More

A bit more about the forward-backward algorithm for CRFs, a question on forward-backward algorithm in CRF++.
A bit more about the feature functions for the log-linear model, How can I implement a CRF feature function?, Log-linear models and conditional random fields.