Saturday, July 7, 2012

Discrete Distribution and Continuous Distribution

Moments of Truth

https://www.paypal-engineering.com/2016/04/11/statistics-for-software/?utm_content=bufferd2f5d&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer



Discrete Distribution

Bernoulli distribution / Binomial distribution

Multinoulli distribution (categorical distribution) / Multinomial distribution

Poisson distribution

Uniform distribution

Continuous Distribution

Normal distribution
1 it has two parameters which are easy to interpret, and which capture some of the most basic properties of a distribution, namely its mean and variance.
2 the central limit theorem tells us that sums of independent random variables have an approximately normal distribution, making it a good choice for modeling residual erros or noise.
3 the normal distribution makes the least number of assumptions(has maximum entropy), subject to the constraint of having a specified mean and variance, This makes it a good default choice in may cases.
4 it has a simple mathematical form, which results in easy to implement, but often highly effective methods.

Student t distribution
Student t is more robust because it has heavier tails, at least for small v.
If v=1, this distribution is known as the Cauchy or Lorentz distribution. This is notable for having such heavy tails that the integral that defines the mean does not converge.
To ensure finite variance, require v>2. 
It is common to use v=4, which gives good performance in a range of problems. 
The smaller v is, the fatter the tails become. For v>>5, the Student distribution rapidly approaches a normal and loses its robustness properties.

Laplace distribution
It is robust to outliers, and also put mores probability density at o than the normal. It is a useful way to encourage sparsity in a model.
Note that the Student distribution is not log-concave for any parameter value, unlike the Laplace distribution, which is always log-concave(and log-convex...). Nevertheless, both are unimodela.

Laplace distribution is similar to the Gaussian distribution, but its first derivate is undefined at zero because it has a very sharp peak at zero (see the following figure). The Laplace distribution concentrates its probability mass much closer to zero compared to the normal distribution; the resulting effect when we use it as a prior is that it has the tendency to make some parameters zero. Thus it is said that lasso (or its Bayesian analog) can be used to regularize and also to perform variable selection (removing some terms or variables from a model).

Gamma distribution
1 a is the shape parameter, b is the rate parameter.
2 if a<=1, the mode is at 0, otherwise mode is > 0. As the rate b increase, the horizontal scale reduce, thus squeezing everything leftwards and upwards.

Beta distribution
1 if a=b=1, beta becomes uniform distribution.
2 if a and b are both less than 1, it get a bimodal distribution with spikes at 0 and 1
3 if and b are both greater than 1, the distribution is unimodal.

Pareto distribution is used to model the distribution of quantities that exhibit long tails, i.e., heavy tails.
1 if we plot the distribution on a log-log scale, it forms a straight line, of the form logp(x)=alogx + c This is known as a power law.

Joint Probability Distribution

Uncorrelated does not imply independent.
Note that the correlation reflects the noisiness and direction of a linear relationship, but not the slope of that relationship, nor many aspects of nonlinear relationships.

Multivariate Gaussian distribution

Multivariate Student's t distribution

Dirichlet distribution
natural generalization of the beta distribution

Transformation of random variables

Linear transformation

General transformation - Jacobian matrix J

Central limit theorem

Monte Carlo approximation

Generate S samples from the distribution, e.g., MCMC(Monte Carlo Approximation), use Monte Carlo to approximate the expected value of any function of a random variable. That is, simple draw samples, and then compute the mean of the function applied tot he samples.

Information Theory

Entropy

The entropy of a random variable is a measure of its uncertainty.

H(X) = sum_1^K p(X=k)log_2 p(X=k)

The cross entropy is the average number of bits (short for binary digits- log base 2) needed to encode data coming from a source with distribution p when model q was used to encode the distribution.
The maximum entropy is the uniform distribution.

KL divergence

One way to measure the dissimilarity of two probability distributions, p and q, is known as the Kullback-Leibler divergence (KL divergence) or relative entropy.

KL(P(X)|Q(X)) = sum_x p(X=x) log_2(p(X=x)/q(X=x))
= sum_1^K p(X=k)log_2 p(X=k) - sum_1^K p(X=k)log_2 q(X=k)
=H(p) - H(p,q)

H(p,q) is called cross entropy, which is the average number of bits needed to encode data coming from a source with distribution p when we use model q to define the model. Hence a regular entropy is the expected number of bits if use the true model, the KL divergence is the difference between two models. In other words, the KL divergence is the average number of extra bits needed to encode the data, due to the fact hat use distribution q to encode the data instead of the true distribution.

Mutual Information

Mutual information of X and Y as the reduction in uncertainty about X after observing Y, or by symmetry, the reduction in uncertainty about Y after observing X.

I(X, Y) = KL(P(X,Y)|P(X)P(Y)) = sum_x sum_y p(x,y) log( p(x,y)/p(x)p(y))
=H(X) + H(Y) - H(X, Y)
= H(X, Y) - H(X|Y) - H(Y|X)

H(X) and H(Y) are the marginal entropies, H(X|Y) and H(Y|X) are the conditional entries, and H(X, Y) is the joint entropy of X and Y.

No comments:

Post a Comment