Mixture Model Tests Of Hierarchical Clustering Algorithms: The Problem Of Classifying Everybody.

Due to the effects of outliers, mixture model tests that require all objects to be classified can severely underestimate the accuracy of hierarchical clustering algorithms. More valid and relevant comparisons between algorithms can be made by calculating accuracy at several levels in the hierarchical tree and considering accuracy as a function of the coverage of the classification. Using this procedure, several algorithms were compared on their ability to resolve ten multivariate normal mixtures. All of the algorithms were significantly more accurate than a random linkage algorithm, and accuracy was inversely related to coverage. Algorithms using correlation as the similarity measure were significantly more accurate than those using Euclidean distance (p < .001). A subset of high accuracy algorithms, including single, average, and centroid linkage using correlation, and Ward's minimum variance technique, was identified.