Finding the Number of Clusters in a Dataset

One of the most difficult problems in cluster analysis is identifying the number of groups in a dataset. Most previously suggested approaches to this problem are either somewhat ad hoc or require parametric assumptions and complicated calculations. In this article we develop a simple, yet powerful nonparametric method for choosing the number of clusters based on distortion, a quantity that measures the average distance, per dimension, between each observation and its closest cluster center. Our technique is computationally efficient and straightforward to implement. We demonstrate empirically its effectiveness, not only for choosing the number of clusters, but also for identifying underlying structure, on a wide range of simulated and real world datasets. In addition, we give a rigorous theoretical justification for the method based on information-theoretic ideas. Specifically, results from the subfield of electrical engineering known as rate distortion theory allow us to describe the behavior of the distortion in both the presence and absence of clustering. Finally, we note that these ideas potentially can be extended to a wide range of other statistical model selection problems.

[1]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[2]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[3]  D. Lindley On a Measure of the Information Provided by an Experiment , 1956 .

[4]  D. A. Bell,et al.  Information Theory and Reliable Communication , 1969 .

[5]  Toby Berger,et al.  Rate distortion theory : a mathematical basis for data compression , 1971 .

[6]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[7]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[8]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[9]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[10]  J. Bernardo Expected Information as Expected Utility , 1979 .

[11]  G. Longo,et al.  The theory of information and coding: A mathematical framework for communication , 1979, Proceedings of the IEEE.

[12]  Richard E. Blahut,et al.  Principles and practice of information theory , 1987 .

[13]  W. Krzanowski,et al.  A Criterion for Determining the Number of Groups in a Data Set Using Sum-of-Squares Clustering , 1988 .

[14]  James O. Berger,et al.  Estimating a Product of Means: Bayesian Analysis with Reference Priors , 1989 .

[15]  H. Joe Relative Entropy Measures of Multivariate Dependence , 1989 .

[16]  O. Mangasarian,et al.  Multisurface method of pattern separation for medical diagnosis applied to breast cytology. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[18]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[19]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[20]  J. Ross Quinlan,et al.  Combining Instance-Based and Model-Based Learning , 1993, ICML.

[21]  Aaron D. Wyner,et al.  Coding Theorems for a Discrete Source With a Fidelity CriterionInstitute of Radio Engineers, International Convention Record, vol. 7, 1959. , 1993 .

[22]  André Hardy,et al.  An examination of procedures for determining the number of clusters in a data set , 1994 .

[23]  E. Soofi Capturing the Intangible Concept of Information , 1994 .

[24]  David L. Neuhoff,et al.  Bennett's integral for vector quantizers , 1995, IEEE Trans. Inf. Theory.

[25]  A. Hardy On the number of clusters , 1996 .

[26]  Adrian E. Raftery,et al.  How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis , 1998, Comput. J..

[27]  William D. Penny,et al.  Bayesian Approaches to Gaussian Mixture Modeling , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  Bertrand Clarke,et al.  A minimally informative likelihood for decision analysis: illustration and robustness , 1999 .

[29]  A. Yuan,et al.  An Information Criterion for Likelihood Selection , 1999, IEEE Trans. Inf. Theory.

[30]  Hichem Frigui,et al.  A Robust Competitive Clustering Algorithm With Applications in Computer Vision , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  Gérard Govaert,et al.  Assessing a Mixture Model for Clustering with the Integrated Completed Likelihood , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[33]  José Carlos Príncipe,et al.  Information Theoretic Clustering , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[34]  Catherine A. Sugar,et al.  Clustering for Sparsely Sampled Functional Data , 2003 .