The prior is subjective parameter to describe the hypothesis spaces.
The posterior is the likelihood times the prior, normalized.
In general , when we have enough data, the posterior becomes peaked on a single concept, namely the MAP estimate. MAP is not a measure of uncertainty; Plugging in the MAP estimate can result in overfitting; the mode of MAP is an untypical point; MAP is not invariant to re-parameterization.
Bayesian model with more parameters do not necessarily have higher marginal likelihood, which is called Bayesian Occams razor.
Beta - Binomial model
Given X ~ Bin(theta),
Prior theta ~ Beta(a, b),
Posterior theta | (Data=N(N1+N0))~ Beta(theta|N1+a, N0+b).
The MLE theta=N1/(N1+N0)
The posterior mean theta = (N1+a)/(N1+N0+a+b)
The mode(MAP) theta = (N1+a-1)/(N1+N0+a-1+b-1)
The variance estimate could be achieved by MLE theta(1-theta)/(N1+N0).
Specifically, if a=b=a, uniform distribution, The posterior mean theta=(N1+1)/(N1+N0+1), add-one smoothing in practice.
The posterior could be sequentially update in a single batch. This makes Bayesian inference particularly well-suited to online learning.
Dirichlet - Multinomial model
Given X~Multinomial(theta1, ..., thetak),
Prior theta1,...thetak ~ Dirichelet(alpha1, ..., alphak),
Posterior theta1, ... thetak | (Data=N(N1+...Nk)) ~ Dirichelet(theta1+alpha1,...thetak+alphak))
The MLE thetak=Nk/(N1+...+Nk)
The mode(MAP) thetak = (Nk+alphak-1)/(N1+...+NK+alpha1-1+...+alphak-1)
Gaussian - Gaussian-Wishart model
Let Z0 be the normalizer of the pior.
Let Sigma be the Wishart distribution.
Let ZN be the normalizer of the posterior.
Naive Bayes Classifiers
Assume the features are conditionally independent given the class label, then the class conditional density is naive bayes classifier.
In the case of real-valued features, assume Gaussiam distribution.
In the case of binary features, assume Bernoulli distribution.
In the case of categorical features, assume Multinoulli distribution.
Priors
Uniform priors
Robust priors
Mixtures of conjugate priors
Hierarchical bayes
Empirical Bayes
Bayes Estimators for common lost functions
MAP estimate minimizes 0-1 loss
Posterior mean minimizes L1 norm
Posterior mean minimizes L2 norm
Reject option for classification
The posterior is the likelihood times the prior, normalized.
In general , when we have enough data, the posterior becomes peaked on a single concept, namely the MAP estimate. MAP is not a measure of uncertainty; Plugging in the MAP estimate can result in overfitting; the mode of MAP is an untypical point; MAP is not invariant to re-parameterization.
Bayesian model with more parameters do not necessarily have higher marginal likelihood, which is called Bayesian Occams razor.
Beta - Binomial model
Given X ~ Bin(theta),
Prior theta ~ Beta(a, b),
Posterior theta | (Data=N(N1+N0))~ Beta(theta|N1+a, N0+b).
The MLE theta=N1/(N1+N0)
The posterior mean theta = (N1+a)/(N1+N0+a+b)
The mode(MAP) theta = (N1+a-1)/(N1+N0+a-1+b-1)
The variance estimate could be achieved by MLE theta(1-theta)/(N1+N0).
Specifically, if a=b=a, uniform distribution, The posterior mean theta=(N1+1)/(N1+N0+1), add-one smoothing in practice.
The posterior could be sequentially update in a single batch. This makes Bayesian inference particularly well-suited to online learning.
Dirichlet - Multinomial model
Given X~Multinomial(theta1, ..., thetak),
Prior theta1,...thetak ~ Dirichelet(alpha1, ..., alphak),
Posterior theta1, ... thetak | (Data=N(N1+...Nk)) ~ Dirichelet(theta1+alpha1,...thetak+alphak))
The MLE thetak=Nk/(N1+...+Nk)
The mode(MAP) thetak = (Nk+alphak-1)/(N1+...+NK+alpha1-1+...+alphak-1)
Gaussian - Gaussian-Wishart model
Let Z0 be the normalizer of the pior.
Let Sigma be the Wishart distribution.
Let ZN be the normalizer of the posterior.
Assume the features are conditionally independent given the class label, then the class conditional density is naive bayes classifier.
In the case of real-valued features, assume Gaussiam distribution.
In the case of binary features, assume Bernoulli distribution.
In the case of categorical features, assume Multinoulli distribution.
Priors
Uniform priors
Robust priors
Mixtures of conjugate priors
Hierarchical bayes
Empirical Bayes
Bayes Estimators for common lost functions
MAP estimate minimizes 0-1 loss
Posterior mean minimizes L1 norm
Posterior mean minimizes L2 norm
Reject option for classification