Jekyll2019-07-26T09:27:49+00:00http://dontloo.github.io/feed.xmlDontloo’s BlogMulticlass Classification and Multiple Binary Classifiers2019-01-04T00:00:00+00:002019-01-04T00:00:00+00:00http://dontloo.github.io/blog/multiclass-binary<p>Placeholder</p>
<p>Should we always use hierarchical softmax instead of softmax?</p>
<p>Pros</p>
<ol>
<li>Each update only affects relevant nodes on path to the root, it’s more efficient</li>
<li>It’s usually better on infrequent categories when the training set is imbalanced https://stats.stackexchange.com/q/180076/95569</li>
</ol>
<p>Cons</p>
<p>XOR problem for a root node</p>PlaceholderAtomic Operations and Randomness2018-12-12T00:00:00+00:002018-12-12T00:00:00+00:00http://dontloo.github.io/blog/atomic-add<p>Placeholder</p>
<p>If we want to compute the histogram of an array in parallel, will it return the same result when it’s called multiple times?
It acutally depends on how the parallel histogram program is implemented. Let’s take a look at two possible implementations.</p>
<h3 id="atomics">atomics</h3>
<h3 id="reduction">reduction</h3>PlaceholderAll (I Know) about Markov Chains2018-06-14T00:00:00+00:002018-06-14T00:00:00+00:00http://dontloo.github.io/blog/mcmc<h2 id="markov-chains">Markov Chains</h2>
<p>By Markov chains we refer to <a href="https://en.wikipedia.org/wiki/Markov_chain#Discrete-time_Markov_chain">discrete-time homogeneous Markov chains</a> in this blog.</p>
<p>A <strong>discrete-time Markov chain</strong> is a sequence of random variables \({X_1, X_2, X_3, …}\) with the Markov property, namely that the probability of moving to the next state depends only on the present state and not on the previous states
\[p(X_{n+1}|X_n,…,X1) = p(X_{n+1}|X_n).\]</p>
<p>A <strong>homogeneous Markov chain</strong> is one that does not evolve in time, that is, its transition probabilities are independent of the time step \(n\)
\[p(X_{n+1}|X_n) \text{ is same } \forall n \leq 0.\]</p>
<h3 id="water-lily-example">Water Lily Example</h3>
<p>To illustrate we’ll take a look at the water lily example from this <a href="https://www.coursera.org/learn/bayesian-methods-in-machine-learning">Bayesian Methods for Machine Learning course</a> on coursera.<br />
<img src="https://raw.githubusercontent.com/dontloo/dontloo.github.io/master/images/lily.png" alt="lily" /><br />
There a frog in a pond that jumps between two water lilies following some probability. As shown in the graph, if the frog is at the left lily at time \(n\) then it has 0.7 probability to jump to the right lily and 0.3 probability to stay till the next time \(n+1\), if the frog is at the right lily then probability for staying and jumping are even (0.5).</p>
<p>This process can be modeled as a Markov chain, let \(X_n\in\{L, R\}\) be the state of the frog at time \(n\), we have the transition probabilities
\[p(X_{n+1}=L|X_n=L)=0.3, p(X_{n+1}=R|X_n=L)=0.7\]
\[p(X_{n+1}=L|X_n=R)=0.5, p(X_{n+1}=R|X_n=R)=0.5.\]</p>
<p>If we let the frog jump as described for a long enough period, it turns out the probability of the state of the frog will converge to a fixed distribution \(\pi\) regardless of the starting state, specifically \(\pi(X=L)=0.44\), \(\pi(X=R)=0.56\). It is an interesting property of Markov chains and we’ll spend the rest of this blog on discussing relevant concepts and applications.</p>
<h2 id="stationary-distribution">Stationary Distribution</h2>
<p>For a Markov chain, if a distribution \(\pi\) remains unchanged through time it is called a stationary distribution, formally it follows
\[\pi(X_{n+1}=i)=\sum_j\pi(X_{n}=j)p(X_{n+1}=i|X_{n}=j).\]</p>
<p>If there’s a finite number of possible states in the sample space, this condition can be put in matrix form as
\[\pi=\pi P\]
where \(P_{ij}=p(X_{n+1}=i|X_{n}=j)\). It might look similar to the definition of eigenvalues and eigenvectors
\[Av=\lambda v\]
In fact \(\pi\) is an eigenvector of \(P^T\) (a.k.a a left eigenvector of \(P\)) with eigenvalue \(\lambda=1\)
\[(P^T-I)\pi=0\]
\(\pi\) is called a vector in the null space of \(P^T-I\) by definition. So to find the stationary distribution of a finite state Markov chain, we only need to solve the above system of linear equations.</p>
<p>One commonly raised issue here is, in what case there is one and only one stationary distribution (one and only one \(\lambda=1\))?</p>
<h3 id="closed-communicating-classes-irreducible-closed-subsets">Closed Communicating Classes (Irreducible Closed Subsets)</h3>
<p>Following <a href="https://en.wikipedia.org/wiki/Markov_chain#Reducibility">the definition on Wikipedia</a></p>
<ul>
<li>A state \(j\) is said to be accessible from a state \(i\) if a system started in state \(i\) has a non-zero probability of transitioning into state \(j\) at some point.</li>
<li>A state \(i\) is said to communicate with state \(j\) if both \(i\) is accessible from \(j\) and vice versa. A communicating class is a maximal set of states \(C\) such that every pair of states in \(C\) communicates with each other.</li>
<li>A communicating class is closed if the probability of leaving the class is zero.
Having defined the notion of a closed communicating class, now we can use it to find out the number of stationary distributions.</li>
</ul>
<p><strong>For finite state Markov chains</strong>, it <a href="https://en.wikipedia.org/wiki/Stochastic_matrix#Definition_and_properties">can be shown</a> that every transition matrix has, at least, an eigenvector associated to the eigenvalue 1 and the largest absolute value of all its eigenvalues is also 1. Hence we know every finite state Markov chain has at least one stationary distribution.</p>
<p>If there’s one closed communicating class in the chain, the chain has exactly one stationary distribution. For example in the following transition matrix,
\[
\begin{matrix}
0.5 & 0 \\
0.5 & 1 \\
\end{matrix}
\]
there’s one closed communicating class consisting of only one state \(C=\{1\}\), and \(\pi=[0, 1]\) is the unique stationary vector. If there’s more than one closed communicating classes, then there are infinitely many stationary distributions. For example the identity matrix
\[
\begin{matrix}
1 & 0 \\
0 & 1 \\
\end{matrix}
\]
each of the two states itself is a closed communicating classes, in this case any probability vector is stationary \(\pi I=\pi \).</p>
<p><strong>For infinite state Markov chains</strong>, the states in closed classes has to be <a href="https://math.stackexchange.com/questions/152991/how-can-i-spot-positive-recurrence"><strong>positive recurrent</strong></a> for stationary distributions to exist (for finite chains we get the positive recurrence within closed classes for free), otherwise there are no stationary distributions.</p>
<p>Here are <a href="http://wwwf.imperial.ac.uk/~ejm/M3S4/NOTES3.pdf">lecture notes from M3S4/M4S4 Applied Probability (Imperial College)</a>, <a href="http://www.columbia.edu/~ks20/stochastic-I/stochastic-I-MCII.pdf">tutorial by Karl Sigman</a> and <a href="https://math.stackexchange.com/q/2954886/291503">this question</a> for reference.</p>
<h3 id="irreducibility">Irreducibility</h3>
<p>If the entire state space forms one closed communicating class, then the chain is called irreducible.</p>
<h2 id="pagerank">PageRank</h2>
<p>The famous PageRank algorithm is an application of the Markov chain stationary distribution. Each web page can be think of as a state, links to other pages then represent the transition probabilities to other states, and each link on a page is considered equally probable.</p>
<p>Having the states and transition probabilities defined, the process of a person randomly clicking on links can be modeled by such Markov chain. The PageRank algorithm ranks pages according to the stationary probability, which can be interpreted as the probability of arriving at a page through random clicking.</p>
<p>As one can imagine the transition matrix can be very sparse as the number of links on a page is far less than the total number of pages. To ensure that there’s a unique stationary distribution, we can add a [teleportation probability to the matrix to make all entries strictly positive, as described in <a href="https://www.youtube.com/watch?v=Q-pCzTpwPBU">here (Lecture 33: Markov Chains Continued Further, Statistics 110)</a> and <a href="http://www.ams.org/publicoutreach/feature-column/fcarc-pagerank">here (How Google Finds Your Needle in the Web’s Haystack)</a>. It’s obvious that a stochastic matrix with strictly positive entries is <strong>irreducible</strong>, hence there’s a unique stationary distribution.</p>
<h2 id="limiting-distribution">Limiting Distribution</h2>
<p>Now we know a Markov chain has a unique stationary distribution \(\pi\) if there’s one closed positive recurrent communicating class, one may wonder how does a Markov chain behave as \(n\to\infty\).</p>
<p>There are two limiting behaviours people usually care about. The first is whether the sample average converges to the stationary distribution as \(n\to\infty\)
\[\lim_{n\to\infty} \frac{1}{n} \sum^{n}_{m=1} I\{X_m=i\}=\pi_i \text{ w.p.1}\]
where \(X_m \sim p(X_m|X_0)\) are samples from the Markov chain (ref: <a href="http://www.columbia.edu/~ks20/stochastic-I/stochastic-I-MCII.pdf">this tutorial by Karl Sigman</a>).</p>
<p>Second if the state distribution always converges to the stationary distribution for any initial probability \(X_0\), then \(\pi\) is called the <a href="https://www.probabilitycourse.com/chapter11/11_2_6_stationary_and_limiting_distributions.php">limiting distribution</a>
\[\pi = \lim_{n\to\infty} p(X_n|X_0). \]
Clearly, a limiting distribution is always a stationary distribution and a Markov chain can have only one limiting distribution.</p>
<p>Let’s illustrate the difference between these two behaviours by an example. The matrix
\[
\begin{matrix}
0 & 1 \\
1 & 0 \\
\end{matrix}
\]
has only one eigenvector \(\pi=[0.5, 0.5]\) associated to the eigenvalue 1. Say a frog start at the left lily, in the next step it will jump to the right lily w.p.1 according to the matrix, after another step, it will again jump to the left and so forth like <a href="https://www.youtube.com/watch?v=d_-j9uuaDOQ">this video</a> shows :))</p>
<p>Actually it will alternate between two states periodically and never reaches a stationary distribution. Nevertheless if we draw two samples at a time, they can be thought of as samples from the stationary distribution as the average of two alternating distributions from this chain is the stationary distribution, so the sample average converges to the expectation w.r.t. the stationary distribution.</p>
<p>It turns out convergence without averaging is a stronger condition than convergence with averaging, to go from the first to the second, a further condition known as aperiodicity is needed for the chain. Discussion about how to check the first property can be found <a href="https://math.stackexchange.com/questions/2816677/what-is-the-necessary-and-sufficient-condition-of-markov-chain-sample-average-co">here</a>, a commonly used sufficient condition is <strong>irreducible + positive recurrent</strong>.</p>
<h3 id="ergodicity">Ergodicity</h3>
<p>Another commonly mentioned property is called ergodicity, which is essentially having a limiting distribution while being irreducible, more specifically,</p>
<ul>
<li>for a finite MC it holds that, <strong>aperiodic + irreducible (which implies positive recurrent) ⇔ ergodic</strong></li>
<li>for an infinite MC it holds that, <strong>aperiodic + irreducible + positive recurrent ⇔ ergodic</strong><br />
for more detail of these different properties you can take a look at <a href="https://www.youtube.com/watch?v=ZjrJpkD3o1w">18.2</a>, <a href="https://www.youtube.com/watch?v=tByUQbJdt14">18.3</a>, <a href="https://www.youtube.com/watch?v=Pce7KKeUf5w">18.4</a> and <a href="https://www.youtube.com/watch?v=daY4lgEyEPc">18.5</a> of the ML video series by mathematicalmonk, and <a href="http://www.statslab.cam.ac.uk/~yms/M7_2.pdf">this tutorial by Yuri Suhov</a>.</li>
</ul>
<h3 id="kl-divergence">KL Divergence</h3>
<p>It is shown in the book Elements of Information Theory 2nd (Section 4.4) that for any state distributions \(p\) and \(q\), a Markov chain will bring them closer together (at least no further away) step after step in terms of KL divergence
\[D_{KL}(p_n|q_n)\geq D_{KL}(p_{n+1}|q_{n+1})\]
If \(q\) is a stationary distribution, we have
\[D_{KL}(p_n|\pi)\geq D_{KL}(p_{n+1}|\pi)\]
which implies that any state distribution gets no further away to every stationary distribution as time passes. The sequence \(D_{KL}(p_n|\pi)\) is a monotonically non-increasing non-negative sequence and must therefore have a limit. The limit is always zero for limiting distributions.</p>
<p>In the previous example, the state distribution alternates between \([1,0]\) and \([0,1]\) the KL divergence \(D_{KL}(p_n|\pi)\), though not increasing, remains \(\log 2\).</p>
<h3 id="the-second-eigenvalue">The Second Eigenvalue</h3>
<p>We know a Markov chain will converge to its limiting distribution from any state if there’s one, so when will it converge? For finite state Markov chains, the convergence speed depends on the second large eigenvalue in magnitude of the transition matrix.</p>
<p>A sloppy sketch of the proof goes as follows. Assume for simplicity that all eigenvalues of the transition matrix \(P\) are real and distinct. Then it can be decomposed as \(P = V\Lambda V^{-1}\), where \(\Lambda\) is a diagonal matrix of eigenvalues, \(V\) is a matrix of right eigenvectors and \(V^{-1}\) is a matrix of left eigenvectors, we know \(P^n =V\Lambda^n V^{-1} \) (ref: <a href="https://www.youtube.com/watch?v=U8R54zOTVLw">Diagonalizing a Matrix</a>). Then we decompose the initial distribution as a linear combination of the left eigenvectors \(X_0=QV^{-1}\) where \(Q\) is a diagonal matrix. The state distribution at time \(n\) is then<br />
\[X_n = X_0P^n =QV^{-1}V\Lambda^nV^{-1} = Q\Lambda^nV^{-1}.\]
Since the largest eigenvalue in \(\Lambda\) is one and others have magnitude smaller than \(1\), the rest elements in \(\Lambda^n\) will vanish as \(n\to\infty\). So \(X_n\) will converge to the stationary distribution eigenvector \(\pi\) in \(V^{-1}\), and the difference between \(X_n\) and \(\pi\) is in proportional to \(|\lambda_2|^n\), where \(\lambda_2\) is the second large eigenvalue in magnitude.</p>
<p>For a detailed proof please see <a href="http://www.tcs.hut.fi/Studies/T-79.250/tekstit/lecnotes_02.pdf">Markov Chains and Random Walks on Graphs</a>, and here is a nice animated illustration <a href="https://www.youtube.com/watch?v=D8DZjLPlWd0&index=2&list=PLaNkJORnlhZmfwQITRbxXCzot3PSXlMYb">exponential convergence and the 3x3 pebble game École normale supérieure</a>.</p>
<h2 id="monte-carlo-sampling">Monte Carlo (Sampling)</h2>
<p>Monte Carlo methods were central to the simulations required for the Manhattan Project during the war. The general idea is to estimate expected values using samples, especially when the exact values are hard to compute analytically
\[E_p[f(X)]=\int f(X)p(X) dX \approx \frac{1}{M}\sum^M_{m=1}f(X_m)\]
where \({X_m}\) are samples from \(p(X)\).</p>
<p>The core problem is how to draw samples from a given distribution and what property should the samples possess. Usually we would want independent samples, while sometimes using correlated samples can be more efficient and accurate in terms of estimating expected values. We’ll illustrate the difference by the following example.</p>
<h3 id="sampling-wheel-low-variance-resampling">Sampling Wheel (Low Variance Resampling)</h3>
<p>For detailed introduction of the sampling wheel (low variance resampling) algorithm please see <a href="http://www.cs.cmu.edu/~16831-f14/notes/F11/16831_lecture04_tianyul.pdf">Particle Filters: The Good, The Bad, The Ugly</a>, we only discuss a simplified version here.</p>
<p>tbc</p>
<h3 id="mcmc">MCMC</h3>
<p>Drawing samples from an arbitrary distribution is not easy, especially on high dimensional space, we’re going to see how to use Markov chains for Monte Carlo estimation a.k.a Markov chain Monte Carlo (MCMC).</p>
<p>The central idea of MCMC is to construct a Markov chain with a unique stationary distribution being the target distribution \(\pi\). Then based on the limiting behaviour of Markov chains
\[\lim_{n\to\infty} \frac{1}{n} \sum_{m=1}^{n} I\{X_m=i\}=\pi_i \text{ w.p.1}\]<br />
it can be derived
\[E_{\pi}[f(X)] \approx \frac{1}{N}\sum^N_{i=1}f(X_i)\]
for any bounded function \(f\).</p>
<p>Further if \(\pi\) is a limiting distribution,
\[\pi = \lim_{n\to\infty} P(X_n|X_0) \]
then starting from an arbitrary state \(X_0\) then state it’ll be in after the chain converges can be seen as a sample from \(\pi\). Based on these two properties, there are many strategies out there for estimating expected values and drawing (independent) samples, which are not to be discussed in this blog.</p>
<p>We’ve already discussed how to check the existence of a unique stationary distribution and a limiting distribution. Now let’s see how to construct a Markov chain with a given stationary distribution \(\pi\), which at the same time should be easier to draw samples from.</p>
<h3 id="gibbs-sampling">Gibbs Sampling</h3>
<p>For some high dimensional distributions, sampling from it’s one dimensional conditional distribution is easy, the Gibbs sampling method makes use of this property by sampling iteratively over each dimension.</p>
<p>Suppose we have a three dimensional distribution \(p(X1, X2, X3)\), and we’re in state \((X1_{n}, X2_{n}, X3_{n})\) at time \(n\), we get to the next state following the strategy
\[X1_{n+1}\sim p(X1|X2_n, X3_n) \]
\[X2_{n+1}\sim p(X2|X1_{n+1}, X3_n) \]
\[X3_{n+1}\sim p(X3|X1_{n+1}, X2_{n+1}). \]
The transition probability is then
\[p(X1_{n+1}, X2_{n+1}, X3_{n+1}|X1_n, X2_n, X3_n) = p(X1|X2_n,X3_n) p(X2|X1_{n+1},X3_n) p(X3|X1_{n+1},X2_{n+1}). \]
It’s easy to show that \(p(X1, X2, X3)\) is a stationary distribution of the Markov chain constructed (see <a href="https://www.coursera.org/lecture/bayesian-methods-in-machine-learning/gibbs-sampling-eZBy5">this lecture</a>).</p>
<h3 id="metropolis-hastings">Metropolis Hastings</h3>
<p><strong>Detailed Balance</strong>
\[\pi(X_{n}=i)p(X_{n+1}=j|X_{n}=i)=\pi(X_{n}=j)p(X_{n+1}=i|X_{n}=j) \forall i,j \]
is a sufficient but not necessary condition for a distribution to be stationary.
Still sampling directly form \(\pi\) is hard, unlike Gibbs sampling, where the transition probability is separated into one dimensional conditional probabilities, the Metropolis Hastings algorithm decomposes the transition probability into a proposal distribution \(q\) (easy to sample from) and an acceptance factor \(\alpha\) (to satisfy detailed balance). Say at time \(n\) we’re in state \(x_n\), then we draw a sample \(x_{n+1}’\) from the proposal distribution \(q(X_{n+1}|X_{n})\) and then accept it with probability \(\alpha(X_{n+1}, X_{n})\), if accepted, we move to the sampled stated, otherwise we remain at the same state. Thus the transition probability becomes
\[p(X_{n+1}=j|X_{n}=i) = q(X_{n+1}=j|X_{n}=i)\alpha(X_{n+1}=j,X_{n}=i) \text{ for } i\neq j\]
\[p(X_{n+1}=i|X_{n}=i) = q(X_{n+1}=i|X_{n}=i)\alpha(X_{n+1}=i,X_{n}=i) + \sum_{j}q(X_{n+1}=j|X_{n}=i)(1-\alpha(X_{n+1}=j,X_{n}=i))\]</p>
<p>The acceptance probability \(\alpha\) need to satisfy
\[\frac{\alpha(X_{n+1}=j,X_{n}=i)}{\alpha(X_{n+1}=i,X_{n}=j)}=\frac{\pi(X_{n}=j)q(X_{n+1}=i|X_{n}=j)}{\pi(X_{n}=i)q(X_{n+1}=j|X_{n}=i)}.\]
Meanwhile its value has to be between 0 and 1, one solution could be
\[\alpha(X_{n+1}=j,X_{n}=i)=\min (1, \frac{\pi(X_{n}=j)q(X_{n+1}=i|X_{n}=j)}{\pi(X_{n}=i)q(X_{n+1}=j|X_{n}=i)} ).\]
Since target distribution \(\pi\) only appears as a ratio, it allows \(\pi\) to be an unnormalized distribution. This is very helpful when the normalizer is hard to compute.</p>
<p>It may seem a bit weird that there’s a large probability to remain at the same state at every move because of the acceptance probability. Actually self loops will affect the speed of convergence, but they won’t change the stationary distribution as long as detailed balance is satisfied, it’s analogous to the fact that adding a constant to all diagonal elements of a matrix doesn’t change the eigenvectors.</p>
<h2 id="discussions">Discussions</h2>
<p>Markov Chain Stationary Distribution Problem</p>Markov Chains By Markov chains we refer to discrete-time homogeneous Markov chains in this blog.Summary of Latent Dirichlet Allocation2018-03-27T00:00:00+00:002018-03-27T00:00:00+00:00http://dontloo.github.io/blog/lda<h3 id="what-does-lda-do">What does LDA do?</h3>
<p>The LDA model converts a bag-of-words document in to a (sparse) vector, where each dimension corresponds to a topic, and topics are learned to capture statistical relations of words. Here’s a nice illustration from <a href="https://www.coursera.org/learn/bayesian-methods-in-machine-learning/home/welcome">Bayesian Methods for Machine Learning
by National Research University Higher School of Economics</a>.
<img src="https://raw.githubusercontent.com/dontloo/dontloo.github.io/master/images/lda1.png" alt="lda" /></p>
<h3 id="lda-vs-vector-space-model">LDA vs. vector space model</h3>
<p>The vector space model representation of a document is a normalized vector in the “word” space,
while the LDA representation is a normalized vector in the “topic” space.
By converting a bag-of-words document from a “word” space into a “topic” space, we can incorporate word correlations learned under the topics.</p>
<h3 id="history">History</h3>
<p>The <a href="http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf">LDA paper</a> gives a good summary of the history of text modeling.<br />
<strong>Unigram model</strong><br />
\[p(\mathbf{w})=\prod_{n=1}^Np(w_n) \]<br />
<strong>Mixture of unigrams</strong> (all words from one topic)<br />
\[p(\mathbf{w})=\sum_zp(z)\prod_{n=1}^Np(w_n) \]<br />
<strong>Probabilistic latent semantic indexing</strong> (each word from one topic)<br />
\[p(d,w_n)=p(d)\sum_zp(w_n|z)p(z|d)\]<br />
<strong>Latent Dirichlet allocation</strong> (latent Dirichlet prior)<br />
\[p(\mathbf{w},\alpha,\beta)=\int p(\theta|\alpha)\prod_{n=1}^N\sum_{z_n}p(w_n|z_n,\beta)p(z_n|\theta) d\theta \]</p>
<h3 id="dirichlet-distribution">Dirichlet Distribution</h3>
<p>Multinomial distribution takes the form
\[Mult(\mathbf{n}|\mathbf{p}, N)={N\choose \mathbf{n}}\prod_{k=1}^Kp_k^{n_k}\]<br />
it is a distribution over the exponents. Dirichlet distribution is in a similar form, only it’s a distribution over the bases
\[Dir(\mathbf{p}|\mathbf{\alpha})=\frac{1}{\mathbf{B}(\mathbf{\alpha})}\prod_{k=1}^Kp_k^{a_k-1}\]<br />
Dirichlet distribution is <strong>sparse</strong> and <strong>multimodal</strong> when \(\alpha<1\) as illustrated <a href="https://cs.stanford.edu/~ppasupat/a9online/1080.html">here</a>.</p>
<h3 id="latent-dirichlet">Latent Dirichlet</h3>
<p>A document is a multinomial over topics, a topic is a multinomial over words, the distribution of words given a document \(p(w|\theta,\beta) = \sum_zp(z|\theta)p(w|z,\beta)=Multi(w|\mu)\), where \(\mu=f_\beta(\theta)=\theta\beta\) is a vector in the “word” space.<br />
<img src="https://raw.githubusercontent.com/dontloo/dontloo.github.io/master/images/lda2.png" alt="lda" /></p>
<p>As shown in the LDA paper, the distribution of \(\mu\) is a continuous mixture with a Dirichlet prior \(p(\theta|\alpha)\) as mixture weights, hence the pdf exhibits a multimodal structure when \(\alpha<1\). It can also be acquired by applying a change of variable to the prior pdf
\[p(\mu|\alpha,\beta) = p(f_\beta(\theta)|\alpha).\]
When \(\mu\) is of greater dimensions than \(\theta\) it actually lives on a subspace of the simplex, where the bases are the rows of \(\beta\) (topics).</p>
<h3 id="training-and-inference">Training and Inference</h3>
<p><strong>Variational inference + expectation-maximization</strong><br />
In the LDA model described above, \(\theta\) and \(z\) are latent variables, \(\alpha\) and \(\beta\) are parameters.</p>
<p>This paper
<a href="http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf">David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. 2003.</a> shows how to use expectation-maximization (EM) for a maximum likelihood estimation (MLE) of the parameters \(\alpha\) and \(\beta\). The main problem is, the distribution of latent variables \(p(\theta,\mathbf{z}|\mathbf{w},\alpha,\beta)\) is intractable to compute, a nice approximation method based on variational inference was given in the paper.</p>
<p>Since MLE looks for high likelihood, there’s an inclination towards sparse \(\alpha\) and \(\beta\).</p>
<p><strong>Collapsed Gibbs sampling</strong><br />
A common extension of the original LDA model is to model \(\beta\) as a latent variable with another Dirichlet distribution \(p(\eta)\) as its prior. Then by treating \(\alpha\) and \(\eta\) as hyper-parameters, this paper <a href="http://www.pnas.org/content/101/suppl_1/5228.short">Thomas L Griffiths and Mark Steyvers. Finding scientific topics. 2004.</a> uses collapsed Gibbs sampling to sample directly from the posterior \(p(\mathbf{z}|\mathbf{w})\) instead of doing an optimization.</p>
<p>The training procedure (convergence/burn-in of Markov chains) can be shown as minimizing the KL divergence between the initial distribution and the equilibrium via the transition (see Section 4.4 of <a href="http://coltech.vnu.edu.vn/~thainp/books/Wiley_-_2006_-_Elements_of_Information_Theory_2nd_Ed.pdf">Elements of Information Theory (Second Edition)</a> and <a href="https://www.cs.cmu.edu/~rsalakhu/papers/mckl.pdf">this tutorial</a>)
\[z_i^{(t+1)}\sim p(z_i|\mathbf{z_{\backslash i}^{(t)}}).\]
In the collapsed Gibbs setting, it turns out the only information necessary for sampling from \(p(\mathbf{z}|\mathbf{w})\) are the document–topic counts \(n_m\) and the topic–term counts \(n_k\), which allows the algorithm to be implemented efficiently.</p>
<p>Based on a full set of samples \(\mathbf{z}\), the distribution of \(\beta\) and \(\theta\) can be estimated by
\[p(\theta|\mathbf{w},\mathbf{z},\alpha) = Dir(\theta|n_m+\alpha)\]
\[p(\beta|\mathbf{w},\mathbf{z},\eta) = Dir(\beta|n_k+\eta)\]
For inference, collapsed Gibbs sampling can also be used to sample the latent variable \(\tilde{\mathbf{z}}\)<br />
\[p(\tilde{\mathbf{z}}|\tilde{\mathbf{w}},\mathbf{z},\mathbf{w})\]
see <a href="http://www.arbylon.net/publications/text-est.pdf">Parameter estimation for text analysis</a> for details.</p>
<h3 id="others">Others</h3>
<p><a href="https://stats.stackexchange.com/questions/337193/similarity-between-documents-in-lda-word-vectors-space">Similarity between documents in LDA “word” vectors space?</a>
<a href="https://cs.stanford.edu/~ppasupat/a9online/1140.html">LSA / PLSA / LDA</a></p>What does LDA do? The LDA model converts a bag-of-words document in to a (sparse) vector, where each dimension corresponds to a topic, and topics are learned to capture statistical relations of words. Here’s a nice illustration from Bayesian Methods for Machine Learning by National Research University Higher School of Economics.Expectation Maximization Sketch2018-03-06T00:00:00+00:002018-03-06T00:00:00+00:00http://dontloo.github.io/blog/em<h3 id="intro">Intro</h3>
<p>Say we have observed data \(X\), the latent variable \(Z\) and parameter \(\theta\), we want to maximize the log-likelihood \(\log p(X|\theta)\). Sometimes it’s not an easy task, probably because it doesn’t have a closed-form solution, the gradient is difficult to compute, or there’re complicated constraints that \(\theta\) must satisfy.</p>
<p>If somehow the joint log-likelihood \(\log p(X, Z|\theta)\) can be maximized more easily, we can turn to the Expectation Maximization algorithm for help. There are several ways to formulate the EM algorithm, as will be discussed in this blog.</p>
<h3 id="joint-log-likelihood">Joint Log-likelihood</h3>
<p>The basic idea is just to optimize the joint log-likelihood \(\log p(X, Z|\theta)\) instead of the data log-likelihood \(\log p(X|\theta)\). But since the true values of latent variables \(Z\) are unknown, we need to estimate a posterior distribution \(p(z|x, \theta)\) for each data point \(x\), then <strong>maximize</strong> the <strong>expected</strong> log-likelihood over the posterior<br />
\[\sum_x E_{p_{z|x}}[\log p(x,z|\theta)].\]</p>
<p>The optimization follows two iterative steps. The E-step computes the expectation under the current parameter \(\theta’\)
\[\sum_x \sum_{z} p(z|x, \theta’) \log p(x,z|\theta) = Q(\theta|\theta’).\]
The M-step tries to find the new parameter \(\theta\) that maximizes \(Q(\theta|\theta’)\). It turns out that such method is guaranteed to find a local maximum data log-likelihood \(\log p(X|\theta)\), as will be shown in later sections.</p>
<h3 id="evidence-lower-bound-elbo">Evidence Lower Bound (ELBO)</h3>
<p>One way to derive EM formally is via constructing the evidence lower bound of \(\log p(X|\theta)\) using Jensen’s inequality
\[\log p(X|\theta) = \sum_x \log p(x|\theta)\]
\[= \sum_x \log \sum_z p(x, z|\theta)\]
\[= \sum_x \log \sum_z q_{z|x}(z) \frac{p(x, z|\theta)}{q_{z|x}(z)}\]
\[\geq \sum_x \sum_z q_{z|x}(z) \log \frac{p(x, z|\theta)}{q_{z|x}(z)}\]
where \(q_{z|x}(z)\) is an arbitrary distribution over the latent variable associated with data point \(x\).</p>
<p>At the E-step, we keep \(\theta\) fixed and find the \(q\) that makes the <a href="https://en.wikipedia.org/wiki/Jensen%27s_inequality#Information_theory">equality hold</a>. Since \(q\) has to satisfy the properties of being a probability distribution, the problem becomes, for each data point \(x\),
\[\max_{q_{z|x}(z)} \sum_z q_{z|x}(z) \log \frac{p(x, z|\theta)}{q_{z|x}(z)}\]<br />
s.t.<br />
\[q_{z|x}(z)\geq 0, \sum_z q_{z|x}(z) = 1.\]
Knowing from the previous section, the solution to this should be \(q_{z|x}(z) = p(z|x, \theta)\). Specifically, if \(z\) is a discrete variable, it can be solved using Lagrange multipliers, see <a href="https://www.ics.uci.edu/~smyth/courses/cs274/readings/domke_notes_on_EM.pdf">this tutorial by Justin Domke</a> (my teacher ;).</p>
<p>At the M-step we maximize over \(\theta\) while keeping \(q_{z|x}\) fixed.
\[\sum_x \sum_z q_{z|x}(z) \log \frac{p(x, z|\theta)}{q_{z|x}(z)} \]
\[= \sum_x \sum_z q_{z|x}(z) \log p(x, z|\theta) - \sum_x \sum_z q_{z|x}(z) \log q_{z|x}(z) \]
\[= Q(\theta|\theta’) + \sum_x H(q_{z|x}) \]
The second term \(H(q_{z|x})\) is independent of \(\theta\) given \(q_{z|x}\) is fixed, so we only need to optimize \(Q(\theta|\theta’)\) which is in line with the previous formulation.</p>
<p>So the M-step maximizes the lower bound w.r.t \(\theta\)) and the E-step sets a new lower bound based on the current value \(\theta\).</p>
<h3 id="latent-distribution">Latent Distribution</h3>
<p>Let’s see now to decompose the lower bound from the data likelihood without using Jensen’s inequality. For simplicity only the derivation of one data point \(x\) is given here.
\[\log p(x|\theta) = \sum_z q_{z|x}(z) \log p(x|\theta) \]
\[= \sum_z q_{z|x}(z) \log \frac{p(x,z|\theta)}{p(z|x,\theta)} \]
\[= \sum_z q_{z|x}(z) \log \frac{p(x,z|\theta)q_{z|x}(z)}{p(z|x,\theta)q_{z|x}(x)} \]
\[= \sum_z q_{z|x}(z) \log \frac{p(x,z|\theta)}{q_{z|x}(z)} - \sum_z q_{z|x}(z) \log \frac{p(z|x,\theta)}{q_{z|x}(z)}\]
\[= F(q_{z|x}, \theta) + D_{KL}(q_{z|x} | p_{z|x})\]
Here \(F(q_{z|x}, \theta)\) is the evidence lower bound and the remaining term is the KL divergence between the latent distribution \(q_{z|x}(z)\) and the posterior \(p(z|x,\theta)\).</p>
<p>We’ve formalized the lower bound as a function (functional) of two parameters, EM essentially does the optimization via coordinate ascent.</p>
<p>In the E-step we optimize \(F(q_{z|x}, \theta)\) w.r.t \(q_{z|x}\) while holding \(\theta\) fixed. Since \(\log p(x|\theta)\) does not depend on \(q_{z|x}\), the largest value of \(F(q_{z|x}, \theta)\) occurs when \(D_{KL}(q_{z|x} | p_{z|x})=0\), we have again \(q_{z|x}(z) = p(z|x,\theta)\). In the M-step \(F(q_{z|x}, \theta)\) is maximized w.r.t \(\theta\), which is the same as the above section.</p>
<h3 id="kl-divergence">KL Divergence</h3>
<p>It turns out the lower bound \(F(q_{z|x}, \theta)\) above is also in the form of KL divergence. If we let \(q(x,z) = q(z|x)p(x)\), where \(q(z|x) = q_{z|x}(z)\) and \(p(x)=\frac{1}{|X|}\sum_{i}\delta_i(x)\) is a distribution that places all its mass on the observed data \(X \), we have
\[\sum_x p(x)f(x) = |X|\sum_x f(x). \]
Then the lower bound can be rewritten as
\[\sum_x F(q_{z|x}, \theta) = \sum_x \sum_z q_{z|x}(z) \log \frac{p(x,z|\theta)}{q_{z|x}(z)}\]
\[= \frac{|X|}{\log|X|} \sum_x \sum_z q(x,z) \log \frac{p(x,z|\theta)}{ q(z|x) } \]
\[= -\frac{|X|}{\log|X|} D_{KL}(q_{x,z} | p_{x,z}). \]
Therefore the E-step is minimizing \(D_{KL}(q_{x,z} | p_{x,z})\) w.r.t \(p_{x,z}\).</p>
<p>Similarly for the \(D_{KL}(q_{z|x} | p_{z|x})\) term, it follows
\[\sum_x D_{KL}(q_{z|x} | p_{z|x}) = - \sum_x \sum_z q_{z|x}(z) \log \frac{p(z|x,\theta)}{ q_{z|x}(z) }\]
\[= - |X| \sum_x \sum_z q(x,z) \log \frac{p(x, z, \theta)}{q(x,z)}\]
\[= |X| D_{KL}(q_{x,z} | p_{x,z}) \]
So the M-step becomes is minimizing the same KL divergence \(D_{KL}(q_{x,z} | p_{x,z})\) but w.r.t \(q\).</p>
<p>Since \(q(x,z)\) follows the restriction that it must aline with the data, and \(p(x,z|\theta)\) must be a distribution under the specified model, they can be thought of as living on two manifolds in the space of all distributions, namely the data manifold and the model manifold. Therefore EM can be viewed as to minimize the distance between two manifolds \(D_{KL}(q_{x,z} | p_{x,z})\) via <a href="https://en.wikipedia.org/wiki/Coordinate_descent"><strong>coordinate descent</strong></a>.</p>
<p>More about the geometric view of EM please refer to <a href="http://mi.eng.cam.ac.uk/~wjb31/PUBS/igmlc.ciss96.pdf">this paper</a>, also see <a href="https://stats.stackexchange.com/questions/335674/how-to-picture-em-algorithm-and-kl-divergence-geometrically">this question on SE</a>.</p>
<h3 id="log-sum-to-sum-log">Log-sum to Sum-log</h3>
<p>In spite of these different views of EM, the advantage of EM lies in Jensen’s inequality which moves the logarithm inside the summation
\[\sum_x \log \sum_z q_{z|x}(z) \frac{p(x, z|\theta)}{q_{z|x}(z)} \geq \sum_x \sum_z q_{z|x}(z) \log \frac{p(x, z|\theta)}{q_{z|x}(z)}. \]
If the joint distribution \(p(x, z|\theta)\) belongs to the exponential family, it turns a log-sum-exp operation into a weighted summation of the exponents (often sufficient statistics), which could be easier to optimize.</p>
<h3 id="alternatives-for-e-step-and-m-step">Alternatives for E-step and M-step</h3>
<p>Sometimes we aren’t able to reach the optimal solution to the E-step or the M-step, probably because the difficulty in calculation, optimization, trade-off between simplicity and accuracy, or other restrictions on distributions or parameters. In these cases, we can use alternative approaches for a suboptimal solution.</p>
<p>For example K-means is a special case of EM for GMMs, where the latent distribution is restricted to be a delta function (hard assignment).</p>
<p>In LDA, a prior distribution is added to the parameter, thus has made the parameter another latent variable and the posterior of latent variables becomes difficult to compute. So variational methods are used for approximation, specifically, the latent distribution \(q\) is characterized by a variational model with parameter \(\psi\). Then in the E-step we optimize \(q\) w.r.t \(\phi\) and in the M-step we optimize \(p\) w.r.t \(\theta\). For parameters that can not be solved in closed-form, gradient based optimization are applied.</p>Intro Say we have observed data \(X\), the latent variable \(Z\) and parameter \(\theta\), we want to maximize the log-likelihood \(\log p(X|\theta)\). Sometimes it’s not an easy task, probably because it doesn’t have a closed-form solution, the gradient is difficult to compute, or there’re complicated constraints that \(\theta\) must satisfy.Central Limit Theorem and Stuff2017-10-20T00:00:00+00:002017-10-20T00:00:00+00:00http://dontloo.github.io/blog/CLT<p>This post is about some understandings of the Central Limit Theorem and (un)related stuff.</p>
<h2 id="central-limit-theorem">Central Limit Theorem</h2>
<p>As defined in <a href="https://en.wikipedia.org/wiki/Central_limit_theorem">Wikipedia</a>, its basic form is
\[ \sqrt{n}(S_n-\mu) \rightarrow N(0,\sigma^2)\]
saying that as \(n\) approaches infinity, the random variable \( \sqrt{n}(S_n-\mu) \) converges in distribution to a normal \( N(0,1) \).</p>
<p>The interesting part is the scaling factor \( \sqrt{n} \),
for a smaller scaling factor the whole term \( \sqrt{n}(S_n-\mu) \) always goes to zero,
for a larger scaling factor the term will blow up, only for \( \sqrt{n} \) it converges to a distribution with constant variance.</p>
<p>It can be proved using the fact convergence of <a href="https://en.wikipedia.org/wiki/Moment-generating_function">moment-generating functions (MGFs)</a> implies the convergence in distribution,
i.e. to prove \( M(\sqrt{n}(S_n-\mu)/\sigma) = M(N(0,1))\).
Here’s a very good lecture on that <a href="https://youtu.be/OprNqnHsVIA">Lecture 29: Law of Large Numbers and Central Limit Theorem | Statistics 110</a>.</p>
<p>So we know for \(n\) i.i.d variables with zero mean, as \(n\) gets larger, its mean goes to zero (the law of large numbers), its sum blows up to positive/negative infinity, but this term \( \sqrt{n}(S_n-\mu) \) converges in distribution to a normal.</p>
<p>CTL is useful in many areas including <a href="https://en.wikipedia.org/wiki/Wiener_process#Wiener_process_as_a_limit_of_random_walk">the Wiener process</a>,
which is often used to model the stock price, see <a href="http://epchan.blogspot.jp/2016/04/mean-reversion-momentum-and-volatility.html">this blog</a> and <a href="https://stats.stackexchange.com/q/308545/95569">this question</a>.</p>
<h2 id="variance-of-sum-of-iid-random-variables">Variance of Sum of IID Random Variables</h2>
<p>The \( \sqrt{n} \) scaling factor also applies when summing a finite number of i.i.d random variables (ref: <a href="https://en.wikipedia.org/wiki/Variance#Sum_of_uncorrelated_variables_(Bienaym%C3%A9_formula)">sum of uncorrelated variables</a>). For example <a href="https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables">sum of normally distributed random variables</a>, and <a href="https://en.wikipedia.org/wiki/Irwin%E2%80%93Hall_distribution">Irwin–Hall distribution</a>.</p>
<p>Note that this is different from applying a change of variable \(z=2x\), which gives a variance of \((2\sigma)^2\) instead of \(2\sigma^2\).</p>
<p>For \(n\) normally distributed random variables with the same variance, it follows
\(\sum (x_n-\mu_n) \sim N(0, n\sigma^2)\), which is equivalent to
\[ \sqrt{n}(S_n-\mu) \sim N(0,\sigma^2)\]
the Central Limit Theorem.</p>
<h2 id="sampling-normal-distributions">Sampling Normal Distributions</h2>
<p>Using the facts above we can approximate samples from any normal distribution given samples from a uniform distribution \(U(0,1)\) (though in fact Box–Muller transform can do that accurately).</p>
<p>According to Central Limit Theorem, the sum of \(n\) uniformly distributed variables approximately follows a normal distribution, \(\sum x \sim N(n/2,\sigma^2)\). According to <a href="https://en.wikipedia.org/wiki/Variance#Sum_of_uncorrelated_variables_(Bienaym%C3%A9_formula)">sum of uncorrelated variables</a>, we can work out the variance \(\sigma^2 = nVar(x)=n/12\), therefore \(\sqrt{n}(\sum x/n-1/2)\) converges in distribution to \( N(0,1/12) \).</p>This post is about some understandings of the Central Limit Theorem and (un)related stuff.Cortana Skills Walkthrough2017-08-23T00:00:00+00:002017-08-23T00:00:00+00:00http://dontloo.github.io/blog/cortana-skills<p>At the time of writing, <a href="https://developer.microsoft.com/en-us/cortana">Cortana</a> supports two ways of creating skills as shown <a href="https://developer.microsoft.com/en-us/cortana/dashboard#!/home">here</a>.</p>
<p>The first option (based on Dorado) is still in its alpha version, you’ll need to send your Microsoft account (MSA) to <code class="highlighter-rouge">mksadmins@microsoft.com</code> to get approved. It provides a unified platform for building and managing Cortana skills, but less control on implementation details.</p>
<p>The second option (based on <a href="https://dev.botframework.com/">Bot Framework</a>) is publically available, together with its <a href="https://channel9.msdn.com/Events/Build/2017/B8031">introduction video</a>.
However in practice there could be many pitfalls for someone starting from scratch. This walkthrough will guide you step by step to create a simple Cortana skill.</p>
<p>As discribed in the <a href="https://docs.microsoft.com/en-us/cortana/tutorials/bot-skills/creating-a-bot-based-skill">official document</a>, the major steps are as follows.</p>
<blockquote>
<ol>
<li>Build or reuse an existing bot using the latest BotBuilder SDK.</li>
<li>Use LUIS.ai in your bot if you need natural language understanding capabilities in your bot.</li>
<li>Use the new speech functionalities in the BotBuilder SDK to give your bot a voice.</li>
<li>Deploy your bot to Azure.</li>
<li>Register your bot with the Bot Framework.</li>
<li>Add your bot to the Cortana Channel.</li>
<li>Publish your Cortana skill.</li>
</ol>
</blockquote>
<p>We’ll walk through each step in details.</p>
<p><strong>0. Set up your Cortana skills development environment.</strong></p>
<p>The first thing to do is make sure you have access to all the resources listed in <a href="https://docs.microsoft.com/en-us/cortana/tutorials/setup-dev-env">this article</a>. Note that the Microsoft account (MSA) required for Cortana and LUIS registration can not be the <code class="highlighter-rouge">yourname@microsfot.com</code> account for Microsoft employees.</p>
<p><strong>1. Build or reuse an existing bot using the latest BotBuilder SDK.</strong></p>
<p><a href="https://docs.microsoft.com/en-us/bot-framework/resources-tools-downloads">Here</a> is how to get the SDK and other tools you might need.
Following the instructions <a href="https://docs.microsoft.com/en-us/bot-framework/bot-builder-overview-getstarted">here</a> you will be able to develop an echo bot on the platform you prefer.</p>
<p><strong>2. Use LUIS.ai in your bot if you need natural language understanding capabilities in your bot.</strong></p>
<p><a href="https://www.luis.ai">LUIS.ai</a> is a language understanding framework that can work with Bot Framework.
Following <a href="https://docs.microsoft.com/en-us/azure/cognitive-services/luis/luis-get-started-create-app">this</a> toturial you will be able to create a LUIS applicaiton.
It’s not necessary to add an intent or entitie to make your LUIS model work, if Luis cannot recognize an intent, it will return an empty string for your bot to handle, see <a href="https://stackoverflow.com/q/41392366/3041068">this question</a>.</p>
<p>After you’ve successfully created an application, you can see the <a href="https://www.luis.ai/applications">application ID</a> and <a href="https://www.luis.ai/keys">keys</a>.
Adding these information to the bot you’ve created in step 1 like <a href="https://github.com/Microsoft/BotBuilder-Samples/blob/master/CSharp/intelligence-LUIS/Dialogs/RootLuisDialog.cs#L14">this</a>, your bot will be able to exploit LUIS features.
<a href="https://github.com/Microsoft/BotBuilder-Samples/tree/master/CSharp/intelligence-LUIS">Here</a> is a more sophisticated LUIS exmaple.</p>
<p><strong>3. Use the new speech functionalities in the BotBuilder SDK to give your bot a voice.</strong></p>
<p>See <a href="https://docs.microsoft.com/en-us/bot-framework/dotnet/bot-builder-dotnet-cortana-skill">this tutorial</a>. Also in align with the above mentioned LUIS example, we can also use <code class="highlighter-rouge">context.SayAsync(text, speech)</code> instead of <code class="highlighter-rouge">context.PostAsync(text)</code> to add speech to your bot.</p>
<p><strong>4. & 5. Deploy your bot to Azure. Register your bot with the Bot Framework.</strong></p>
<p>See <a href="https://docs.microsoft.com/en-us/bot-framework/deploy-bot-overview">here</a> and <a href="https://docs.microsoft.com/en-us/bot-framework/portal-register-bot">here</a> on how to deply and register your bot, after registeration, remember to configurate the <code class="highlighter-rouge">web.config</code> file as follows.</p>
<blockquote>
<p>If you’re using the Bot Builder SDK for Node.js, set the following environment variables:</p>
<ul>
<li>MICROSOFT_APP_ID</li>
<li>MICROSOFT_APP_PASSWORD</li>
</ul>
<p>If you’re using the Bot Builder SDK for .NET, set the following key values in the web.config file:</p>
<ul>
<li>MicrosoftAppId</li>
<li>MicrosoftAppPassword</li>
</ul>
</blockquote>
<p><strong>6. & 7. Add your bot to the Cortana Channel. Publish your Cortana skill.</strong></p>
<p>Step 6 and 7 are trivial, see <a href="https://docs.microsoft.com/en-us/cortana/tutorials/bot-skills/add-bot-to-cortana-channel">this turorial</a>. Beware of these <a href="https://docs.microsoft.com/en-us/cortana/testing/known-issues">known issues</a>, in particular the “LuisDialog fails on skill Launch” issue described in <a href="https://stackoverflow.com/q/45860583/3041068">this question</a>.</p>
<p>Now you’ll be able to try out the Cortana skill you’ve just built, make sure the Cortana on your system is logged in with the same account you used for development.</p>
<p>Have fun!</p>At the time of writing, Cortana supports two ways of creating skills as shown here.Softmax2017-06-18T00:00:00+00:002017-06-18T00:00:00+00:00http://dontloo.github.io/blog/softmax<h3 id="normalization-functions">Normalization functions</h3>
<p>There are many ways of doing normalization, in this article we focus on the following type
\[y=f(x) \text{ s.t. } y_i\geq 0, \sum y_i=1. \]
Such normalization functions are widely used in the machine learning field, for example, to represent discrete probability distributions.</p>
<h3 id="how-to-construct-a-normalization-function">How to construct a normalization function</h3>
<p>A straightforward approach is</p>
<ol>
<li>Map input from real numbers to non-negative numbers (to ensure \(y_i\geq 0\))</li>
<li>Divide by the sum (to ensure \(\sum y_i=1\))</li>
</ol>
<h3 id="softmax">Softmax</h3>
<p>Softmax is a good example of this type</p>
<ol>
<li>The exponentiation \(e^x\) maps \(x\) from real numbers to positive numbers</li>
<li>Divide by \(\sum e^{x_i}\)</li>
</ol>
<p>Then we have Softmax \[y_i = \frac{e^{x_i}}{\sum e^{x_j}}\]</p>
<h3 id="why-the-name-softmax">Why the name Softmax</h3>
<p>Suppose the max function returns a one-hot vector with one at the index of maximum value and zero elsewhere, for example
\[x=[-2,-1], \quad max(x)=[0,1].\]
Softmax will return a “softened” version where the index of maximum value still has the largest value but is smaller than one<br />
\[softmax(x)=[0.269, 0.731].\]</p>
<h3 id="why-softmax">Why Softmax</h3>
<p>People use Softmax because it is the canonical link function for the cross entropy loss, which basically means the derivative for the duo is nice, ref: <a href="https://math.stackexchange.com/questions/945871/derivative-of-softmax-loss-function">Derivative of Softmax loss function</a></p>
<p>\[L =\sum t_i\log y_i = \sum t_i\log\frac{e^{x_i}}{\sum e^{x_j}} = \sum t_i (x_i - \log\sum e^{x_j})\]
\[\frac{\partial L}{\partial x_i} = y_i - t_i.\]</p>
<p>For better numerical stability, the operation \(\log\sum e^{x}\) can be implemented using the <a href="https://en.wikipedia.org/wiki/LogSumExp">LogSumExp trick</a>. Otherwise we’ll often need to clip its value to avoid numerical under/overflow, which will lead to inaccurate derivative.</p>
<p>Therefore in practice, some libraries implement Softmax as part of the loss function for the aforementioned benefits in forward and backward passes. The operation of computing \(x_i - \log\sum e^{x_i}\) is often called <em>LogSoftMax</em>.</p>Normalization functions There are many ways of doing normalization, in this article we focus on the following type \[y=f(x) \text{ s.t. } y_i\geq 0, \sum y_i=1. \] Such normalization functions are widely used in the machine learning field, for example, to represent discrete probability distributions.Model the Joint Likelihood?2017-05-31T00:00:00+00:002017-05-31T00:00:00+00:00http://dontloo.github.io/blog/model-joint-likelihood<p>When building a classifier, we often optimize the conditional log-likelihood \(\log p_\theta(t|x)=\log Multi(t|y(x), n=1) = \sum t_k\log y_k \) w.r.t some parameter \(\theta\), where \(x\) is the input, \(t\) is the target and \(y\) is the output of the classifier (network) which is guaranteed to be nonnegative and \(\sum y_k=1\) via normalization (e.g. softmax).</p>
<p>Mostly the normalization does something like \(y_k = \hat{y_k}/\sum \hat{y_j}\), following the multinomial model we can interpret each \(y_k\) as \(p_\theta(t_k|x)\), we could further interpret \(\hat{y_k}\) (nonnegative) as \(p_\theta(t_k,x)\), then it follows \(p_\theta(t_k|x) = \hat{y_k}/\sum \hat{y_j} = p_\theta(t_k,x)/p_\theta(x) \), which is just the Bayes rule.</p>
<p>So we have the joint probability defined, it seems we can directly read off \(\sum \hat{y_j}\) as the marginal likelihood \(p_\theta(x)\). However the problem is, \(p_\theta(x)\) could be just any distribution (say very high for one input and very low for another and vice versa), as long as the corresponding \(p_\theta(t_k,x)\) is of the same scale, the loss won’t be so different. Indeed it can be a meaningful distribution if we introduce some assumptions to \(p_\theta(x)\).</p>
<p>For instance, say we let \(p_\theta(x)\) be a Gaussian mixture model, then \(p_\theta(t_k|x)\) is just the “responsibility” of the \(k\)th mixture component. If we optimize \(p_\theta(t_k|x)\) directly, we have the same issue, a data point can be far away from any component while still having the correct conditional likelihood. To obtain a meaningful marginal distribution, we can add \(p_\theta(x)\) to the loss, which will then become the joint log-likelihood \(\log p_\theta(x,t) = \log p_\theta(t|x) + \log p_\theta(x)\).</p>
<p>Then it reminds me of the <a href="https://www.cv-foundation.org/openaccess/content_cvpr_2015/ext/1A_089_ext.pdf">Google FaceNet</a>, which basically minimizes the distance between data points in the same category and maximizes the distance between data points from different categories. This is equivalent to minimizing the variance within each Gaussian mixture component while maximizing the distance between component centers. Thus we can set the loss to be \(\log p_\theta(x|t) - dist(\theta)\). There’re many ways to implement \(dist(\theta)\), one obvious way is to compute the distance between every pair of data and add them together. Or as mentioned in this paper <a href="https://arxiv.org/pdf/1703.03130.pdf">(A Structured Self-attentive Sentence Embedding)</a>, we can use \(-\left||cov(\theta)-I|\right|^2\) to set the covariance matrix of component centers close to the identity matrix. However this will also lead to redundant components when the number of components is larger than the dimension of data. There are other <a href="https://en.wikipedia.org/wiki/Statistical_dispersion">dispersion measures</a> we could try.</p>When building a classifier, we often optimize the conditional log-likelihood \(\log p_\theta(t|x)=\log Multi(t|y(x), n=1) = \sum t_k\log y_k \) w.r.t some parameter \(\theta\), where \(x\) is the input, \(t\) is the target and \(y\) is the output of the classifier (network) which is guaranteed to be nonnegative and \(\sum y_k=1\) via normalization (e.g. softmax). Mostly the normalization does something like \(y_k = \hat{y_k}/\sum \hat{y_j}\), following the multinomial model we can interpret each \(y_k\) as \(p_\theta(t_k|x)\), we could further interpret \(\hat{y_k}\) (nonnegative) as \(p_\theta(t_k,x)\), then it follows \(p_\theta(t_k|x) = \hat{y_k}/\sum \hat{y_j} = p_\theta(t_k,x)/p_\theta(x) \), which is just the Bayes rule. So we have the joint probability defined, it seems we can directly read off \(\sum \hat{y_j}\) as the marginal likelihood \(p_\theta(x)\). However the problem is, \(p_\theta(x)\) could be just any distribution (say very high for one input and very low for another and vice versa), as long as the corresponding \(p_\theta(t_k,x)\) is of the same scale, the loss won’t be so different. Indeed it can be a meaningful distribution if we introduce some assumptions to \(p_\theta(x)\).Naive Bayes and Logistic Regression2017-04-25T00:00:00+00:002017-04-25T00:00:00+00:00http://dontloo.github.io/blog/naive-bayes-and-logistic-regression<h3 id="introduction">Introduction</h3>
<p>Naive Bayes and logistic regression are two basic machine learning models that are compared frequently,
especially as the generative/discriminative counterpart of one another.
However at first sight it seems these two methods are rather different.
In naive Bayes we just count the frequencies of features and labels while in linear regression we optimize the parameters with regard to some loss function.
If we express theses two models as probabilistic graphical models, we’ll see exactly how they are related.</p>
<h3 id="graphical-models">Graphical Models</h3>
<p><img src="https://raw.githubusercontent.com/dontloo/dontloo.github.io/master/images/gen.png" alt="disvsgen" /></p>
<p><strong>naive Bayes</strong><br />
As shown in the figure (borrowed from <a href="http://people.cs.umass.edu/~mccallum/papers/crf-tutorial.pdf">this tutorial</a> without permission),
the naive Bayes model can be expressed as a directed graph where the parent node denotes the output and the leaf nodes denote the input,
the joint probability is then
\[p(x, y)=p(y)\prod_n p(x_i|y).\]</p>
<p><strong>logistic regression</strong><br />
The logistic regression model can be expressed as the conditional counterpart of the naive Bayes model.
\[\tilde{p}(x, y)=\psi(y)\prod_n\psi(x_i, y)\]
\[p(y|x)=\frac{1}{z(x)}\tilde{p}(x, y)\]
\[z(x)=\sum_y\tilde{p}(x, y).\]</p>
<blockquote>
<p>To have the network structure and parameterization correspond naturally to a conditional distribution, we want to avoid representing a probabilistic model over x. We therefore disallow potentials that involve only variables in x. –Probabilistic Graphical Models</p>
</blockquote>
<h3 id="maximum-likelihood-estimation-mle">Maximum Likelihood Estimation (MLE)</h3>
<p>In the simplest case where both the input and output are binary values, for the naive Bayes model,
both \(p(y)\) and \(p(x_n|y)\) can be modeled as Bernoulli distributions
\[p(y)=Ber(y|\theta_0)\]
\[p(x_i|y)=Ber(x_i|\theta_{yi})\]
\[p(y=1|x)=\frac{\theta_0\prod \theta_{1i}^{x_i}(1-\theta_{1i})^{1-x_i}}{\theta_0\prod \theta_{1i}^{x_i}(1-\theta_{1i})^{1-x_i}+(1-\theta_0)\prod \theta_{0i}^{x_i}(1-\theta_{0i})^{1-x_i}}.\]
Then the MLE for \(\theta\) could simply be solved by counting the frequencies.</p>
<p>For the logistic regression model, we can choose the log-linear representation with feature functions as the followings
\[\psi(x_i, y)=\exp(w_i\phi(x_i, y))=\exp(w_i I(x_i=1, y=1))\]
\[\psi(y)=\exp(w_0\phi(y))=\exp(w_0 y)\]
then it follows
\[\tilde{p}(x, y=1)=\exp(\sum_n w_ix_i+w_0)\]
\[\tilde{p}(x, y=0)=\exp(\sum_n w_i\times 0+w_0\times 0)=1\]
\[p(y=1|x)=\frac{\tilde{p}(x, y=1)}{\tilde{p}(x, y=1)+\tilde{p}(x, y=0)}=\sigma(\sum_n w_ix_i+w_0)\]
\[p(y|x)=Ber(y|\sigma(\sum_n w_ix_i+w_0)).\]
The MLE for \(w\) can be done by minimizing the negative log-likelihood (a.k.a cross-entropy).</p>
<p>In this example, the number of effective parameters in \(w\) is fewer than \(\theta\) as it only models the conditional probability (we can always only parameterize n-1 out of n categories then the probability of the remaining category can be inferred as they sum up to one).</p>
<h3 id="overfitting-and-zero-probabilities">Overfitting and Zero Probabilities</h3>
<p>The most common problem for MLE is overfitting, for naive Bayes it can lead to zero probabilities for unseen data. Smoothing techniques are often used to mitigate this, which can be interpreted as a prior assumption.</p>
<h3 id="conditional-independence">Conditional Independence</h3>
<p>Logistic regression is consistent with the naive Bayes assumption that the input \(x_i\) are conditionally independent
given \(Y\). Nevertheless unlike NB (which models every \(p(x_i|y)\) separately), LR only optimizes for the conditional likelihood, it has better performance than NB when the data disobeys the assumption (e.g. when two features are highly correlated).</p>Introduction Naive Bayes and logistic regression are two basic machine learning models that are compared frequently, especially as the generative/discriminative counterpart of one another. However at first sight it seems these two methods are rather different. In naive Bayes we just count the frequencies of features and labels while in linear regression we optimize the parameters with regard to some loss function. If we express theses two models as probabilistic graphical models, we’ll see exactly how they are related.