Supervised hierarchical clustering using CART

The size and complexity of current data mining data sets have eclipsed the limits of traditional statistical techniques. Such large datasets frequently require some form of cluster analysis, usually in the form of a hierarchical cluster analysis. However the implementation of a traditional hierarchical scheme on large datasets requires an additional cluster validation analysis. Classification and Regression Trees (CART) are a non-parametric regression and classification technique that have become popular within the biotechnology and ecological fields. CARTs intuitive interpretation, and ability to handle large datasets make it easily accessible to the non-statistician by presenting the statistical relationships found in the form of a binary tree. This paper proposes a supervised clustering algorithm capable of finding real clusters within large datasets by using CART as a means of filtering the clusters found using any hierarchical technique. The supervision performed by CART acts as a filter of the results from a hierarchical cluster analysis by merging or removing poorly defined groups. It is common practice to validate a cluster analysis using descriminant analysis, however this assumes that the correct number of clusters is known. CART implements a selective classification of groups allowing for some groups not to be explicitly classified, a feature not supported by standard descriminant analysis. This selective classification acts in two fold, firstly by filtering or merging clusters that are not validated by the data, and secondly, as a relationship model for the clusters found and provides statistical measures of certainty over the analysis. An example of this method is presented using Sea Surface Temperatures (SST). This is an ideal choice as very little statistical cluster analysis has been implemented on this dataset, yet knowledge of such structure is in high demand. The analysis is performed for one month November for the years 1940 through to 2002, where some of the most useful variation is expected. The supervised clustering technique successful extracted seven meaningful clusters, which predicted with a cross-validated classification rate of 0.50.