# Latest Posts

Memory Rnn placehodler Multiclass Classification and Multiple Binary Classifiers Placeholder Atomic Operations and Randomness Placeholder All (I Know) about Markov Chains Markov ChainsBy Markov chains we refer to discrete-time homogeneous Markov chains in this blog. Summary of Latent Dirichlet Allocation What does LDA do?The LDA model converts a bag-of-words document in to a (sparse) vector, where each dimension corresponds to a topic, and topics are learned to capture statistical relations of words. Here’s a nice illustration from Bayesian Methods for Machine Learningby National Research University Higher School of Economics. Expectation Maximization Sketch IntroSay we have observed data \(X\), the latent variable \(Z\) and parameter \(\theta\), we want to maximize the log-likelihood \(\log p(X|\theta)\). Sometimes it’s not an easy task, probably because it doesn’t have a closed-form solution, the gradient is difficult to compute, or there’re complicated constraints that \(\theta\) must satisfy. Central Limit Theorem and Stuff This post is about some understandings of the Central Limit Theorem and (un)related stuff. Cortana Skills Walkthrough At the time of writing, Cortana supports two ways of creating skills as shown here. Softmax Normalization functionsThere are many ways of doing normalization, in this article we focus on the following type\[y=f(x) \text{ s.t. } y_i\geq 0, \sum y_i=1. \]Such normalization functions are widely used in the machine learning field, for example, to represent discrete probability distributions. Model the Joint Likelihood? When building a classifier, we often optimize the conditional log-likelihood \(\log p_\theta(t|x)=\log Multi(t|y(x), n=1) = \sum t_k\log y_k \) w.r.t some parameter \(\theta\), where \(x\) is the input, \(t\) is the target and \(y\) is the output of the classifier (network) which is guaranteed to be nonnegative and \(\sum y_k=1\) via normalization (e.g. softmax).Mostly the normalization does something like \(y_k = \hat{y_k}/\sum \hat{y_j}\), following the multinomial model we can interpret each \(y_k\) as \(p_\theta(t_k|x)\), we could further interpret \(\hat{y_k}\) (nonnegative) as \(p_\theta(t_k,x)\), then it follows \(p_\theta(t_k|x) = \hat{y_k}/\sum \hat{y_j} = p_\theta(t_k,x)/p_\theta(x) \), which is just the Bayes rule.So we have the joint probability defined, it seems we can directly read off \(\sum \hat{y_j}\) as the marginal likelihood \(p_\theta(x)\). However the problem is, \(p_\theta(x)\) could be just any distribution (say very high for one input and very low for another and vice versa), as long as the corresponding \(p_\theta(t_k,x)\) is of the same scale, the loss won’t be so different. Indeed it can be a meaningful distribution if we introduce some assumptions to \(p_\theta(x)\).