Wednesday, June 12, 2013

Bayesian Model

The prior is subjective parameter to describe the hypothesis spaces.

The posterior is the likelihood times the prior, normalized.

In general , when we have enough data, the posterior becomes peaked on a single concept, namely the MAP estimate. MAP is not a measure of uncertainty; Plugging in the MAP estimate can result in overfitting; the mode of MAP is an untypical point; MAP is not invariant to re-parameterization.

Bayesian model with more parameters do not necessarily have higher marginal likelihood, which is called Bayesian Occams razor.

Beta - Binomial model

Given X ~ Bin(theta),
Prior theta ~ Beta(a, b),
Posterior theta | (Data=N(N1+N0))~ Beta(theta|N1+a, N0+b).

The MLE theta=N1/(N1+N0)
The posterior mean theta = (N1+a)/(N1+N0+a+b)
The mode(MAP) theta = (N1+a-1)/(N1+N0+a-1+b-1)
The variance estimate could be achieved by MLE theta(1-theta)/(N1+N0).

Specifically, if a=b=a, uniform distribution, The posterior mean theta=(N1+1)/(N1+N0+1), add-one smoothing in practice.

The posterior could be sequentially update in a single batch. This makes Bayesian inference particularly well-suited to online learning.

Dirichlet - Multinomial model

Given X~Multinomial(theta1, ..., thetak),
Prior theta1,...thetak ~ Dirichelet(alpha1, ..., alphak),
Posterior theta1, ... thetak | (Data=N(N1+...Nk)) ~ Dirichelet(theta1+alpha1,...thetak+alphak))

The MLE thetak=Nk/(N1+...+Nk)
The mode(MAP) thetak = (Nk+alphak-1)/(N1+...+NK+alpha1-1+...+alphak-1)

Gaussian - Gaussian-Wishart model
Let Z0 be the normalizer of the pior.
Let Sigma be the Wishart distribution.
Let ZN be the normalizer of the posterior.

Naive Bayes Classifiers

Assume the features are conditionally independent given the class label, then the class conditional density is naive bayes classifier.

In the case of real-valued features, assume Gaussiam distribution.
In the case of binary features, assume Bernoulli distribution.
In the case of categorical features, assume Multinoulli distribution.

Priors

Uniform priors
Robust priors
Mixtures of conjugate priors
Hierarchical bayes
Empirical Bayes

Bayes Estimators for common lost functions

MAP estimate minimizes 0-1 loss

Posterior mean minimizes L1 norm

Posterior mean minimizes L2 norm

Reject option for classification


No comments:

Post a Comment