Saturday, July 7, 2012

Simpson's Paradox

Simpson's Paradox, which is when the estimate of an aggregate effect is misleading and markedly different than the effect seen in underlying categories. A famous example occurred in graduate admissions at the University of California at Berkeley, where an apparent bias in admission was due instead to the fact that different departments has different overall admission rates and numbers of applicants. 

Discrete Distribution and Continuous Distribution

Moments of Truth

https://www.paypal-engineering.com/2016/04/11/statistics-for-software/?utm_content=bufferd2f5d&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer



Discrete Distribution

Bernoulli distribution / Binomial distribution

Multinoulli distribution (categorical distribution) / Multinomial distribution

Poisson distribution

Uniform distribution

Continuous Distribution

Normal distribution
1 it has two parameters which are easy to interpret, and which capture some of the most basic properties of a distribution, namely its mean and variance.
2 the central limit theorem tells us that sums of independent random variables have an approximately normal distribution, making it a good choice for modeling residual erros or noise.
3 the normal distribution makes the least number of assumptions(has maximum entropy), subject to the constraint of having a specified mean and variance, This makes it a good default choice in may cases.
4 it has a simple mathematical form, which results in easy to implement, but often highly effective methods.

Student t distribution
Student t is more robust because it has heavier tails, at least for small v.
If v=1, this distribution is known as the Cauchy or Lorentz distribution. This is notable for having such heavy tails that the integral that defines the mean does not converge.
To ensure finite variance, require v>2. 
It is common to use v=4, which gives good performance in a range of problems. 
The smaller v is, the fatter the tails become. For v>>5, the Student distribution rapidly approaches a normal and loses its robustness properties.

Laplace distribution
It is robust to outliers, and also put mores probability density at o than the normal. It is a useful way to encourage sparsity in a model.
Note that the Student distribution is not log-concave for any parameter value, unlike the Laplace distribution, which is always log-concave(and log-convex...). Nevertheless, both are unimodela.

Laplace distribution is similar to the Gaussian distribution, but its first derivate is undefined at zero because it has a very sharp peak at zero (see the following figure). The Laplace distribution concentrates its probability mass much closer to zero compared to the normal distribution; the resulting effect when we use it as a prior is that it has the tendency to make some parameters zero. Thus it is said that lasso (or its Bayesian analog) can be used to regularize and also to perform variable selection (removing some terms or variables from a model).

Gamma distribution
1 a is the shape parameter, b is the rate parameter.
2 if a<=1, the mode is at 0, otherwise mode is > 0. As the rate b increase, the horizontal scale reduce, thus squeezing everything leftwards and upwards.

Beta distribution
1 if a=b=1, beta becomes uniform distribution.
2 if a and b are both less than 1, it get a bimodal distribution with spikes at 0 and 1
3 if and b are both greater than 1, the distribution is unimodal.

Pareto distribution is used to model the distribution of quantities that exhibit long tails, i.e., heavy tails.
1 if we plot the distribution on a log-log scale, it forms a straight line, of the form logp(x)=alogx + c This is known as a power law.

Joint Probability Distribution

Uncorrelated does not imply independent.
Note that the correlation reflects the noisiness and direction of a linear relationship, but not the slope of that relationship, nor many aspects of nonlinear relationships.

Multivariate Gaussian distribution

Multivariate Student's t distribution

Dirichlet distribution
natural generalization of the beta distribution

Transformation of random variables

Linear transformation

General transformation - Jacobian matrix J

Central limit theorem

Monte Carlo approximation

Generate S samples from the distribution, e.g., MCMC(Monte Carlo Approximation), use Monte Carlo to approximate the expected value of any function of a random variable. That is, simple draw samples, and then compute the mean of the function applied tot he samples.

Information Theory

Entropy

The entropy of a random variable is a measure of its uncertainty.

H(X) = sum_1^K p(X=k)log_2 p(X=k)

The cross entropy is the average number of bits (short for binary digits- log base 2) needed to encode data coming from a source with distribution p when model q was used to encode the distribution.
The maximum entropy is the uniform distribution.

KL divergence

One way to measure the dissimilarity of two probability distributions, p and q, is known as the Kullback-Leibler divergence (KL divergence) or relative entropy.

KL(P(X)|Q(X)) = sum_x p(X=x) log_2(p(X=x)/q(X=x))
= sum_1^K p(X=k)log_2 p(X=k) - sum_1^K p(X=k)log_2 q(X=k)
=H(p) - H(p,q)

H(p,q) is called cross entropy, which is the average number of bits needed to encode data coming from a source with distribution p when we use model q to define the model. Hence a regular entropy is the expected number of bits if use the true model, the KL divergence is the difference between two models. In other words, the KL divergence is the average number of extra bits needed to encode the data, due to the fact hat use distribution q to encode the data instead of the true distribution.

Mutual Information

Mutual information of X and Y as the reduction in uncertainty about X after observing Y, or by symmetry, the reduction in uncertainty about Y after observing X.

I(X, Y) = KL(P(X,Y)|P(X)P(Y)) = sum_x sum_y p(x,y) log( p(x,y)/p(x)p(y))
=H(X) + H(Y) - H(X, Y)
= H(X, Y) - H(X|Y) - H(Y|X)

H(X) and H(Y) are the marginal entropies, H(X|Y) and H(Y|X) are the conditional entries, and H(X, Y) is the joint entropy of X and Y.

Thursday, July 5, 2012

Collinearity and VIF


Collinearity occurs when two or more variables are highly associated. Including them in a linear model can result in confusing, nonsensical, or misleading results, because the model cannot differentiate the contribution from each of them.

Because visits and transactions are so highly related, and also because a linear model assumes that effects are additive, an effect attributed to one variable (such as transactions) is not available in the model to be attributed jointly to another that is highly correlated (visits). This will cause the standard errors to he predictors to increase, which means that the coefficient estimates will be highly uncertain or unstable. As a practical consequence, this may cause coefficient estimates to differ dramatically from sample to sample due to minor variation in the data even when underlying relationships are the same.

The degree of collinearity in data can be assessed as the variance inflation factor(VIF). This estimates how much the standard error (variance) of a coefficient in the linear model is increased because of shared variance with other variables, compared to the situation if the variables were un-correlated or simple single predictor regression were performed.

The VIF provides a measure of shared variance among variables in a model. A common rule of thumb is that VIF > 5.0 indicates the need to mitigate collinearity.

There are three general strategies for mitigating collinearity:
- Omit variables that are highly correlated.
- Eliminate correlation by extracting principal components or factors for sets of highly correlated predictors.
- Use a method that is robust to collinearity, i.e., something other than traditional linear modeling, e.g, random forest, which only uses a subset of variables at a time. Or, use PCA to extract the first component from the variables.
In all, common approaches to fixing collinearity include omitting highly correlated variables, and using principle components or factor scores instead of individual items.







SEM to estimate Structural Equation Modeling in R

Structural models are helpful when your modeling needs meet any of the conditions:
- To evaluate interconnection of multiple data points that do not map neatly to the division between predictors and an outcome variable
- To include unobserved latent variables such as attitudes and estimate their relationships to one another or to observed data
- To estimate the overall fit between observed data and a proposed model with latent variables or complex connections.

Structural models are closely related to both linear modeling because they estimate associations and model fit, and to factor analysis because they use latent variables.

With regard to latent variables, the models can be used to estimate the association between outcomes such as purchase behavior and underlying attitudes that influence those, such as brand perception, brand preference, likelihood to purchase, and satisfaction.

Create graphical path diagram of influences and then estimate the strength of relationship for each path int he model. Such paths often concern two kinds of variables: manifest variables that are observed, i.e., that have data points, and latent variables that are conceived to underlie the observed data.

With SEM, it is feasible to do several things that improve our models: to include multiple influences, to posit unobserved concepts that underlie the observed indicators (i.e., constructs such as brand preference, likelihood to purchase, and satisfaction), it specify how those concepts influence one another, to assess the model's overall congruence to the data, and  to determine whether the model fits the data better tan alternative models.

SEM creates a graphical path diagram of influences and then estimating the strength of relationship for each path in the model. Such paths often concern two kinds of variables: manifest variables that are observed, i.e., that have data points, and latent variables that are conceived to underlie the observed data. The set of relationships among the latent variables is called the structural model, while the linkage between those elements and the observed, manifest variables is the measurement model.

Structural equation models are similar to linear regression models., but differ in three regards.
First, they assess the relationships among many variables, with models that may be more complex than simply predictors and outcomes.
Second, those relationships allow for latent variables that represent underlying constructs that are thought to be manifested imperfectly in the observed data.
Third, the models allow relationships to have multiple 'downstream' effects.

Two general approaches to SEM are the covariance-based approach (CB-SEM), which attempts to model the relationships among the variables at once and thus is a strong test of the model, and the partial least squares approach (PLS-SEM), which fits parts of the data sequentially and has less stringent requirements. 

After specify a CB-SEM model, simulate a data set using simulateData() from lavaan with reasonable guesses as to variable loadings, Use the simulated data to determine whether your model is likely to converge for the sample size your expect. 

Plot your specified model graphically and inspect it carefully to check that it is the model you intended to estimate. 

Whenever possible, specify one or two alternative models and check those in additional to your model. Before accepting a CB-SEM model, use compareFit() to demonstrate that your model fits the data better than alternatives. 


If you have data of varying quality, nominal categories, small sample, or problems converging a CB-SEM model, consider partial least squares SEM (PLS-SEM). 


##################################################################
## SEM
##################################################################
## install.packages("sem")
require(sem)

R.DHP <- readMoments(diag=FALSE, names=c("ROccAsp", "REdAsp", "FOccAsp",
                                         "FEdAsp", "RParAsp", "RIQ", "RSES", "FSES", "FIQ", "FParAsp"),
                     text="
                     .6247
                     .3269 .3669
                     .4216 .3275 .6404
                     .2137 .2742 .1124 .0839
                     .4105 .4043 .2903 .2598 .1839
                     .3240 .4047 .3054 .2786 .0489 .2220
                     .2930 .2407 .4105 .3607 .0186 .1861 .2707
                     .2995 .2863 .5191 .5007 .0782 .3355 .2302 .2950
                     .0760 .0702 .2784 .1988 .1147 .1021 .0931 -.0438 .2087
                     ")
model.dhp.1 <- specifyEquations(covs="RGenAsp, FGenAsp", text="
                                RGenAsp = gam11*RParAsp + gam12*RIQ + gam13*RSES + gam14*FSES + beta12*FGenAsp
                                FGenAsp = gam23*RSES + gam24*FSES + gam25*FIQ + gam26*FParAsp + beta21*RGenAsp
                                ROccAsp = 1*RGenAsp
                                REdAsp = lam21(1)*RGenAsp # to illustrate setting start values
                                FOccAsp = 1*FGenAsp
                                FEdAsp = lam42(1)*FGenAsp
                                ")
sem.dhp.1 <- sem(model.dhp.1, R.DHP, 329,
                 fixed.x=c('RParAsp', 'RIQ', 'RSES', 'FSES', 'FIQ', 'FParAsp'))
summary(sem.dhp.1)