### probabilistic distributions over the whole real line

Why many distributions over the real number line decrease towards both ends (e.g. Cauchy, Gaussian)? Why there can not be a uniform distribution over the over entire space? Because we have to make sure the PDF integrates to one.

In other words, the improper integral \[\int^\infty_{-\infty}\tilde{p}(x)dx\] has to converge, which means, the limit of the anti-derivative of the unnormalized distribution \(\tilde{p}\) at (negative) infinity has to converge to a finite number. For instance, \(\tilde{p}(x)=\frac{1}{1+x^2}\) (Cauchy) does and \(\tilde{p}(x)=\frac{1}{1+|x|}\) does not.

### tailedness

The 4th moment (or any order moment) alone doesn’t tell us about the tailedness of a probabilistic distribution, the combination of two different order moments (e.g. kurtosis) does.

### normalized Euclidean-like distance

Euclidean distance is one of the most commonly used metric, its squared form is \(\|x-y\|^2=\|x\|^2+\|y\|^2-2x\cdot y\). It can be generalized to \(\|x\|^2+\|y\|^2-cx\cdot y\), which could be more effective in certain cases (e.g. where the values of \(x\) and \(y\) are non-negative).

If provided that \(x\) and \(y\) are unit (normalized) vectors, this metric is essentially equivalent (only differ in a scaling factor \(c\)) to the original Euclidean distance, since \(\|x\|^2\) and \(\|y\|^2\) are then just equal to 1. It is also equivalent to the negative cosine similarity of the unnormalized vectors (Wikipedia).

### correlated samples

It says in the DQN paper that

To alleviate the problems of correlated data and non-stationary distributions, we use an experience replay mechanism which randomly samples previous transitions, and thereby smooths the training distribution over many past behaviors.

It might be difficult to conceptualize what’s the problem with correlated samples, since in the end when we have done sufficient enough number of iterations we’d have seen all possibilities of data many times anyway. However in fact it just doesn’t work (for SGD). Even in the simplest example, the MNIST, if we arrange the training data in order such that almost in every iteration the training samples are of the same label (say all images in the batch are zeros), the network won’t get trained either.

So we see the power of i.i.d data, to address the correlated sample problem, I guess the techniques of constructing i.i.d data from a MCMC sequence might also apply here.

### power of deep neural networks

Deep neural networks can be trained to do a wide range of amazing jobs, which makes many people (especially from other fields) think of neural networks as some dark magic that seems to have the potential to do everything. However actually the power of deep neural networks is, it is able to be optimized well (through many empirical tricks) towards very abstract supervision signals.