# Blog

Multiclass Classification and Multiple Binary Classifiers Placeholder Atomic Operations and Randomness Placeholder All (I Know) about Markov Chains Markov ChainsBy Markov chains we refer to discrete-time homogeneous Markov chains in this blog. Summary of Latent Dirichlet Allocation What does LDA do?The LDA model converts a bag-of-words document in to a (sparse) vector, where each dimension corresponds to a topic, and topics are learned to capture statistical relations of words. Here’s a nice illustration from Bayesian Methods for Machine Learningby National Research University Higher School of Economics. Expectation Maximization Sketch IntroSay we have observed data \(X\), the latent variable \(Z\) and parameter \(\theta\), we want to maximize the log-likelihood \(\log p(X|\theta)\). Sometimes it’s not an easy task, probably because it doesn’t have a closed-form solution, the gradient is difficult to compute, or there’re complicated constraints that \(\theta\) must satisfy. Central Limit Theorem and Stuff This post is about some understandings of the Central Limit Theorem and (un)related stuff. Cortana Skills Walkthrough At the time of writing, Cortana supports two ways of creating skills as shown here. Softmax Normalization functionsThere are many ways of doing normalization, in this article we focus on the following type\[y=f(x) \text{ s.t. } y_i\geq 0, \sum y_i=1. \]Such normalization functions are widely used in the machine learning field, for example, to represent discrete probability distributions. Model the Joint Likelihood? When building a classifier, we often optimize the conditional log-likelihood \(\log p_\theta(t|x)=\log Multi(t|y(x), n=1) = \sum t_k\log y_k \) w.r.t some parameter \(\theta\), where \(x\) is the input, \(t\) is the target and \(y\) is the output of the classifier (network) which is guaranteed to be nonnegative and \(\sum y_k=1\) via normalization (e.g. softmax).Mostly the normalization does something like \(y_k = \hat{y_k}/\sum \hat{y_j}\), following the multinomial model we can interpret each \(y_k\) as \(p_\theta(t_k|x)\), we could further interpret \(\hat{y_k}\) (nonnegative) as \(p_\theta(t_k,x)\), then it follows \(p_\theta(t_k|x) = \hat{y_k}/\sum \hat{y_j} = p_\theta(t_k,x)/p_\theta(x) \), which is just the Bayes rule.So we have the joint probability defined, it seems we can directly read off \(\sum \hat{y_j}\) as the marginal likelihood \(p_\theta(x)\). However the problem is, \(p_\theta(x)\) could be just any distribution (say very high for one input and very low for another and vice versa), as long as the corresponding \(p_\theta(t_k,x)\) is of the same scale, the loss won’t be so different. Indeed it can be a meaningful distribution if we introduce some assumptions to \(p_\theta(x)\). Naive Bayes and Logistic Regression IntroductionNaive Bayes and logistic regression are two basic machine learning models that are compared frequently, especially as the generative/discriminative counterpart of one another. However at first sight it seems these two methods are rather different. In naive Bayes we just count the frequencies of features and labels while in linear regression we optimize the parameters with regard to some loss function. If we express theses two models as probabilistic graphical models, we’ll see exactly how they are related. MLE, MAP and Bayesian Methods MLE and MAPOne most common situation is, we have a model that could produce the (unnormalized) probability \( p(x|\theta) \) for some observation \( x \). We are often interested in the most probable \( \theta \) given the data, i.e. \( \theta^* = \arg\max_\theta p(\theta|x) \). Notes on Coding Neural Networks Encapsulated Neural Network LibrariesThere’re many great open source libraries for neural networks and deep learning. Some of them try to wrap every function they provide into an uniform interface or protocol (so-called define and run, e.g. caffe and tensorflow frontend), such well encapsulated libraries might be easy to use but difficult to change. As the rapid development of deep learning, it becomes a common need for people in the field to experiment new ideas beyond those encapsulations, often I found that the very interface or protocol I need is just the programming language itself. More on Joint Bayesian Verification DerivationsThe followings are some detailed derivatation of the formulars in the paper Bayesian Face Revisited: A Joint Formulation. Conditional Random Fields Summary The Big Picture Recipe for 99%+ Accuracy Face Recognition IntroThe title is exaggerated, actually by “99%+ accuracy face recognition” I mean “99+% accuracy on the LFW dataset”. This recipe contains every big idea you need to know to reproduce the results, and it depends on public data sets only. About Search Algorithms Tree Based and Hash Based SearchPerhaps their simplest applications of tree based and hash based search are tree maps and hash maps. Miscellaneous probabilistic distributions over the whole real lineWhy many distributions over the real number line decrease towards both ends (e.g. Cauchy, Gaussian)? Why there can not be a uniform distribution over the over entire space? Because we have to make sure the PDF integrates to one. How to Create a Blog Like This First create a repo on Github with the name username.github.io. Copy everything from https://github.com/mmistakes/so-simple-theme.git (or another Jekyll theme repo) to the repo. In the _config.yml file, set url: https://username.github.io. Celebrate.