On the number of groups in clustering

Clustering is the problem of partitioning data into a finite number k of homogeneous and separate groups, called clusters. A good choice of k is essential for building meaningful clusters. In this paper, this task is addressed from the point of view of model selection via penalization. We design an appropriate penalty shape and derive an associated oracle-type inequality. The method is illustrated on both simulated and real-life data sets.

[1]  A. Hardy On the number of clusters , 1996 .

[2]  P. Massart,et al.  Minimal Penalties for Gaussian Model Selection , 2007 .

[3]  David G. Stork,et al.  Pattern classification, 2nd Edition , 2000 .

[4]  P. Massart,et al.  From Model Selection to Adaptive Estimation , 1997 .

[5]  T. Linder LEARNING-THEORETIC METHODS IN VECTOR QUANTIZATION , 2002 .

[6]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[7]  郝立丽,et al.  A Criterion for Determining the Number of Clusters , 2008 .

[8]  Ting Su,et al.  A deterministic method for initializing K-means clustering , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[9]  W. Krzanowski,et al.  A Criterion for Determining the Number of Groups in a Data Set Using Sum-of-Squares Clustering , 1988 .

[10]  David G. Stork,et al.  Pattern Classification , 1973 .

[11]  Pascal Massart,et al.  Data-driven Calibration of Penalties for Least-Squares Regression , 2008, J. Mach. Learn. Res..

[12]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[13]  John N. Pierce Asymptotic quantizing error for unbounded random variables (Corresp.) , 1970, IEEE Trans. Inf. Theory.

[14]  Dong-Jo Park,et al.  A Novel Validity Index for Determination of the Optimal Number of Clusters , 2001 .

[15]  Shai Ben-David,et al.  A Sober Look at Clustering Stability , 2006, COLT.

[16]  J. William Ahwood,et al.  CLASSIFICATION , 1931, Foundations of Familiar Language.

[17]  Sung-Hyon Myaeng,et al.  Initializing K-Means using Genetic Algorithms , 2009 .

[18]  Pedro Larrañaga,et al.  An empirical comparison of four initialization methods for the K-Means algorithm , 1999, Pattern Recognit. Lett..

[19]  Catherine A. Sugar,et al.  Finding the Number of Clusters in a Dataset , 2003 .

[20]  Shehroz S. Khan,et al.  Cluster center initialization algorithm for K-means clustering , 2004, Pattern Recognit. Lett..

[21]  Tamás Linder On the training distortion of vector quantizers , 2000, IEEE Trans. Inf. Theory.

[22]  P. Massart,et al.  Risk bounds for model selection via penalization , 1999 .

[23]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[24]  Isabelle Guyon,et al.  A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[25]  Harald Luschgy,et al.  Functional quantization rate and mean regularity of processes with an application to Lévy processes , 2008 .

[26]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[27]  Bertrand Michel,et al.  Slope heuristics: overview and implementation , 2011, Statistics and Computing.

[28]  Eytan Domany,et al.  Resampling Method for Unsupervised Estimation of Cluster Validity , 2001, Neural Computation.

[29]  Shai Ben-David,et al.  Stability of k -Means Clustering , 2007, COLT.

[30]  M. Denckla,et al.  Rapid ‘automatized’ naming (R.A.N.): Dyslexia differentiated from other learning disabilities , 1976, Neuropsychologia.

[31]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[32]  atherine,et al.  Finding the number of clusters in a data set : An information theoretic approach C , 2003 .

[33]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[34]  Shai Ben-David,et al.  Relating Clustering Stability to Properties of Cluster Boundaries , 2008, COLT.

[35]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[36]  S. Graf,et al.  Foundations of Quantization for Probability Distributions , 2000 .

[37]  Flávio Miguel Varejão,et al.  K-Means Initialization Methods for Improving Clustering by Simulated Annealing , 2008, IBERAMIA.

[38]  Ohad Shamir,et al.  Cluster Stability for Finite Samples , 2007, NIPS.

[39]  P. Massart,et al.  Concentration inequalities and model selection , 2007 .

[40]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .