Clustering and classification methods

The chapter by Milligan and Hirtle provides an overview of the current state of knowledge in the field of clustering and classification. Such methods are used to find groups in multivariate data sets. The methods are discussed within the context of exploratory data analysis, though some confirmatory or testing methods are reviewed. A survey of the issues critical to the analysis of empirical data is presented along with “best practice” recommendations for the applied user. Coverage includes sections on data preparation, data models, and data representation using distance and similarity measures. The section on clustering algorithms covers a wide range of classification methods. In addition, the algorithms section includes a discussion of the known cluster recovery performance of various selected clustering methods. The fourth section covers a variety of issues important for applied analyses such as data sampling, variable selection, variable standardization, choosing the number of clusters, and postclassification analysis of the results. Threaded into the discussion are three example applications of the methodology to empirical data. The examples are based on perceived kinship data, animal similarity data, and the classification of single malt scotch whiskies. Keywords: classification validation; cluster analysis; clustering algorithms; Monte Carlo methods; similarity measures; tree models of data

[1]  Harvey A. Skinner,et al.  Differentiating the Contribution of Elevation, Scatter and Shape in Profile Similarity , 1978 .

[2]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[3]  L. Fisher,et al.  391: A Monte Carlo Comparison of Six Clustering Procedures , 1975 .

[4]  G. W. Milligan,et al.  A Review Of Monte Carlo Tests Of Cluster Analysis. , 1981, Multivariate behavioral research.

[5]  M. C. Cooper,et al.  The effect of measurement error on determining the number of clusters in clusteranalysis , 1988 .

[6]  G. W. Milligan,et al.  A study of standardization of variables in cluster analysis , 1988 .

[7]  G. W. Milligan,et al.  A validation study of a variable weighting algorithm for cluster analysis , 1989 .

[8]  E. Fowlkes,et al.  Variable selection in clustering , 1988 .

[9]  J. Breckenridge Replicating Cluster Analysis: Method, Consistency, and Validity. , 1989, Multivariate behavioral research.

[10]  G. W. Milligan,et al.  Mapping Influence Regions in Heirarchical Clustering. , 1995, Multivariate behavioral research.

[11]  Roger K. Blashfield,et al.  Mixture model tests of cluster analysis: Accuracy of four agglomerative hierarchical methods. , 1976 .

[12]  W. Welch Algorithmic complexity: three NP- hard problems in computational statistics , 1982 .

[13]  G. W. Milligan,et al.  A Two-Stage Clustering Algorithm with Robust Recovery Characteristics , 1980 .

[14]  R. Mojena,et al.  Hierarchical Grouping Methods and Stopping Rules: An Evaluation , 1977, Comput. J..

[15]  J. V. Ness,et al.  Admissible clustering procedures , 1971 .

[16]  G. S. Johnson,et al.  An Information-Intensive Approach to the Molecular Pharmacology of Cancer , 1997, Science.

[17]  Richard C. Dubes,et al.  Cluster validity profiles , 1982, Pattern Recognit..

[18]  R. Blashfield,et al.  A Nearest-Centroid Technique for Evaluating the Minimum-Variance Clustering Procedure. , 1980 .

[19]  P. Arabie,et al.  Indclus: An individual differences generalization of the adclus model and the mapclus algorithm , 1983 .

[20]  P. Brucker On the Complexity of Clustering Problems , 1978 .

[21]  G. Soete OVWTRE: A program for optimal variable weighting for ultrametric and additive tree fitting , 1988 .

[22]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[23]  Akinori Okada A REVIEW OF CLUSTER ANALYSIS RESEARCH IN JAPAN , 1996 .

[24]  G. W. Milligan,et al.  A monte carlo study of thirty internal criterion measures for cluster analysis , 1981 .

[25]  A. D. Gordon A Review of Hierarchical Classification , 1987 .

[26]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[27]  Vladimir Makarenkov,et al.  Optimal Variable Weighting for Ultrametric and Additive Trees and K-means Partitioning: Methods and Software , 2001, J. Classif..

[28]  G. Soete Optimal variable weighting for ultrametric and additive tree clustering , 1986 .

[29]  J Zubin,et al.  ON THE METHODS AND THEORY OF CLUSTERING. , 1969, Multivariate behavioral research.

[30]  G. W. Milligan,et al.  A Study of the Beta-Flexible Clustering Method. , 1989, Multivariate behavioral research.

[31]  W. T. Williams,et al.  A Generalized Sorting Strategy for Computer Classifications , 1966, Nature.

[32]  G. W. Milligan,et al.  A NOTE ON PROCEDURES FOR TESTING THE QUALITY OF A CLUSTERING OF A SET OF OBJECTS , 1980 .

[33]  Charles K. Bayne,et al.  Monte Carlo comparisons of selected clustering procedures , 1980, Pattern Recognit..

[34]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[35]  D. Duffy,et al.  A permutation-based algorithm for block clustering , 1991 .

[36]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[37]  A. Tversky,et al.  Additive similarity trees , 1977 .

[38]  G. N. Lance,et al.  A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems , 1967, Comput. J..

[39]  Geert De Soete,et al.  A least squares algorithm for fitting an ultrametric tree to a dissimilarity matrix , 1984, Pattern Recognit. Lett..

[40]  Lawrence Hubert,et al.  The comparison and fitting of given classification schemes , 1977 .

[41]  G. Milligan,et al.  K-Means Clustering Methods with Influence Detection , 1996 .

[42]  G. W. Milligan,et al.  The validation of four ultrametric clustering algorithms , 1980, Pattern Recognit..

[43]  J. Hartigan REPRESENTATION OF SIMILARITY MATRICES BY TREES , 1967 .

[44]  G. W. Milligan,et al.  Measuring the influence of individual data points in a cluster analysis , 1996 .

[45]  P. Green,et al.  A preliminary study of optimal variable weighting in k-means clustering , 1990 .

[46]  Anil K. Jain,et al.  Validity studies in clustering methodologies , 1979, Pattern Recognit..

[47]  J. Wolfe PATTERN CLUSTERING BY MULTIVARIATE MIXTURE ANALYSIS. , 1970, Multivariate behavioral research.

[48]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[49]  G. W. Milligan,et al.  Methodology Review: Clustering Methods , 1987 .

[50]  J. Carroll,et al.  Synthesized clustering: A method for amalgamating alternative clustering bases with differential weighting of variables , 1984 .

[51]  D. F. Andrews,et al.  PLOTS OF HIGH-DIMENSIONAL DATA , 1972 .

[52]  J. Hartigan,et al.  Representing Points in Many Dimensions by Trees and Castles , 1981 .

[53]  D Scheibler,et al.  Monte Carlo Tests of the Accuracy of Cluster Analysis Algorithms: A Comparison of Hierarchical and Nonhierarchical Methods. , 1985, Multivariate behavioral research.

[54]  A. Tversky,et al.  Spatial versus tree representations of proximity data , 1982 .

[55]  V. E. Kane,et al.  Estimating the number of groups and group membership using simulation cluster analysis , 1982, Pattern Recognit..

[56]  Glenn W. Milligan,et al.  Hierarchical Clustering Algorithms with Influence Detection , 1995 .

[57]  G. W. Milligan,et al.  A Comparison of Two Approaches to Beta-Flexible Clustering. , 1992, Multivariate behavioral research.

[58]  M. Lorr,et al.  Personality profiles of police candidates. , 1994, Journal of clinical psychology.

[59]  P. Arabie,et al.  The interface among data analysis, marketing, and representation of knowledge , 1988 .

[60]  H. Lee Swanson,et al.  Effects of Dynamic Testing on the Classification of Learning Disabilities: The Predictive and Discriminant Validity of the Swanson-Cognitive Processing Test (S-CPT) , 1995 .

[61]  Chockalingam Viswesvaran,et al.  A Meta-Analytic Method for Testing Hypotheses about Clusters of Decision Makers , 1994 .

[62]  Steven Salzberg,et al.  On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach , 1997, Data Mining and Knowledge Discovery.

[63]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[64]  L C Morey,et al.  A Comparison of Cluster Analysis Techniques Withing a Sequential Validation Framework. , 1983, Multivariate behavioral research.

[65]  I. Mechelen,et al.  Structural analysis of the intension and extension of semantic concepts , 1994 .

[66]  Irving John Good C129. An index of separateness of clusters and a permutation test for its statistical significance , 1982 .

[67]  E. Rothkopf A measure of stimulus similarity and errors in some paired-associate learning tasks. , 1957, Journal of experimental psychology.

[68]  H. P. Friedman,et al.  On Some Invariant Criteria for Grouping Data , 1967 .

[69]  C. Edelbrock Mixture Model Tests Of Hierarchical Clustering Algorithms: The Problem Of Classifying Everybody. , 1979, Multivariate behavioral research.

[70]  R N SHEPARD,et al.  Analysis of Proximities as a Technique for the Study of Information Processing in Man1 , 1963, Human factors.

[71]  G. W. Milligan,et al.  An examination of the effect of six types of error perturbation on fifteen clustering algorithms , 1980 .

[72]  C. Edelbrock,et al.  Hierarchical Cluster Analysis Using Intraclass Correlations: A Mixture Model Study. , 1980, Multivariate behavioral research.

[73]  M. A. Wong,et al.  A Hybrid Clustering Method for Identifying High-Density Clusters , 1982 .