Wednesday, November 9, 2011

PROC CLUSTER, PROC FASTCLUS to run Variable Selection for Clustering Analysis

Variable Standardization
Standardization refers to data transformation that involves correction of variables using either means or standard deviations or both, depending on the data and context of analysis. SAS PROC STDIZE allows users to standardize a set of variables using several criteria, including the following methods:

RANGE standardization is most helpful when variables are measured on different scales. Variables with large values and a wide range of variation have significant effects on the final similarity measure. Hence, it is essential to make sure that each variable is evenly constituted in the distance measurement by means of data standardization.

MEAN/MEADIAN offers centering refers to variables being adjusted using only the mean or median across the variables. The variable mean or median is subtracted from its original score. Standard deviation is not adjusted in this process.

STD standardization refers to correction of variables using the variable mean and standard deviation. The variable mean is subtracted from its original score and then divided by the standard deviation.

L (p) refers to adjusting the scores using Minkowski distance. For example, if p=2, then Euclidean distance is applied.

proc stdize data=dataset method=range out=outdataset outstat=stats;
var &varlist;
run;

Standardization to unit variability can be seen as a special case of weighting, where the weights are the reciprocals of variability. In some cases, however, defining weights as inversely proportional to some measure of total variable can actually dilute the difference between clusters. This is major reason why weights based on sample range are usually more effective when clustering than weights based on the standard deviation.

Clustering Methodology

Types of Models

SAS offers different clustering algorithms including k-means clustering, non-parametric clustering, and hierarchical clustering.

1. k-means clustering is, perhaps, the most popular partitive clustering algorithm. One reason for its popularity is that the time required to reach convergence is proportion to the number of observations being clustered, which means it can be used to cluster larger data sets. However, k-means clustering is inappropriate for small data sets (<100 cases). The solution becomes sensitive to the order of the observations and the variables, which is known as order effect.

title 'K-Means Clustering using Adaptive Training';
proc fastclus data=dataset maxclusters=5 maxiter=100 least=2 drift replace=full distance out=cluster;
var &varlist;
run;

Option MAXCLUSTERS=specifies the maximum number of clusters allowed. Instead of using MAXCLUSTERS=, option RADIUS= is also able to establishe the minimum distance criterion for selecting new seeds. No observation is considered as a new seed unless its minimum distance to previous seeds exceeds the value given by the RADIUS= option. The default value is 0.

Option MAXITER= specifies the number of iterations to facilitate convergence. By default, MAXITER=1. When the value of the MAXITER= option is greater than 0, each observation is assigned to the nearest seed, and the seeds are recomputed as the means of the clusters.

Option LEAST= specifies distance metrics. By default, LEAST=2, which is Euclidean distance. LEAST=1 implements city block distance, and LEAST=MAX implements maximum absolute deviation. In other words, PROC FASTCLUS requires numeric data, and be sensitive to extreme values, so standardization in data preparation is important.

Option DRIFT is specified the closest seed moves as each case is assigned to it.

Option REPLACE= specifies how seed replacement is performed. Option REPLACE=FULL requests default seed replacement. Option REPLACE=PART requests seed replacement only when the distance between the observation and the closest seed is greater than the minimum distance between seeds. Option REPLACE=NONE suppresses seed replacement. Option REPLACE=RANDOM selects a simple pseudo-random sample of complete observations as initial cluster seeds.

Option DISTANCE computes distances between the cluster means.

2. Nonparametric methods can detect clusters of unequal size and dispersion, or clusters with irregular shapes. If the covariance matrices are unequal or radically non-normal, nonparametric density estimation is often the best approach. And nonparametric methods are less sensitive than most clustering techniques to changes in scale.

title 'Nonparametric Clustering';
proc modeclus data=dataset method=1 r=5.75330 join out=nonpar_results;
var &varlist;
run;


Option R= specifies the radius of the sphere of support for uniform-kernel density estimation and the neighborhood for clustering. Option METHOD= specifies what clustering method to use. For most purposes, METHOD=1 is recommended.

3. Hierarchical methods form the backbone of cluster analysis. The popularity of hierarchical method is partly due to the fact that they are not subject to the order effect. And also, some hierarchical methods can even recover irregular clusters directly. Unfortunately, hierarchical methods often require processing times on the order of the square. This limits their use to small and mid-sized data sets.

title2'Hierarchical Solution (Average)';
proc cluster data=dataset method=average outtree=tree simple rmsstd rsquare;
var &varlist;
copy customer_key;
run;

Option SIMPLE displays simple, descriptive statistics.

The METHOD= specification determines the clustering method used by the procedure. The example used average method. Other popular methods include single, centriod, twostage, and ward.

Option RMSSTD displays the pooled standard deviation of all the variables of each cluster. Since the objective of cluster analysis is to form homogeneous groups, the RMSSTD of a cluster should be as small as possible.

Option RSQUARE displays the R-squared and semi-partial R-squared to evaluate cluster solution. R-squared measures the extent to which groups or clusters are different from each other (so, when you have just one cluster R-squared value is, intuitively, zero). Thus, the R-squared value should be high. semi-partial R-squared is the loss of homogeneity due to combining two groups or clusters to form a new group or cluster. Thus, the semi-partial R-squared value should be small to imply that we are merging two homogeneous groups.

proc tree data=tree nclusters=5 dock=5 out=results2 noprint;
copy &varlist;
run;

PROC TREE procedure produces a tree diagram, also known as a dendrogram or phenogram, using a data set created by PROC CLUSTER. PROC CLUSTER creates output data sets that contain the results of hierarchical clustering as a tree structure. PROC TREE uses the output data set to produce a diagram of the tree structure.

Option NCLUSTERS= specifies the number of clusters desired in the OUT= data set.

The COPY statement specifies one or more character or numeric variables to be copied to the OUT= data set.

No comments:

Post a Comment

Blog Archive