Estimation of the Number of Clusters Using Multiple Clustering Validity Indices

One of the challenges in unsupervised machine learning is finding the number of clusters in a dataset. Clustering Validity Indices (CVI) are popular tools used to address this problem. A large number of CVIs have been proposed, and reports that compare different CVIs suggest that no single CVI can always outperform others. Following suggestions found in prior art, in this paper we formalize the concept of using multiple CVIs for cluster number estimation in the framework of multi-classifier fusion. Using a large number of datasets, we show that decision-level fusion of multiple CVIs can lead to significant gains in accuracy in estimating the number of clusters, in particular for high-dimensional datasets with large number of clusters.

[1]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[2]  In-Chan Choi,et al.  A Comparison Study of Cluster Validity Indices Using a Nonhierarchical Clustering Algorithm , 2005, International Conference on Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce (CIMCA-IAWTIC'06).

[3]  Michalis Vazirgiannis,et al.  Cluster validity methods: part I , 2002, SGMD.

[4]  Ben Choi,et al.  Automatically Discovering the Number of Clusters in Web Page Datasets , 2005, DMIN.

[5]  Robert P. W. Duin,et al.  The combining classifier: to train or not to train? , 2002, Object recognition supported by user interaction for service robots.

[6]  Ujjwal Maulik,et al.  Performance Evaluation of Some Clustering Algorithms and Validity Indices , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Isabelle Guyon,et al.  A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[9]  Richard J. Wallace Determining the Basis for Performance Variations in CSP Heuristics , 2007 .

[10]  André Hardy,et al.  An examination of procedures for determining the number of clusters in a data set , 1994 .

[11]  James C. Bezdek,et al.  Some new indexes of cluster validity , 1998, IEEE Trans. Syst. Man Cybern. Part B.

[12]  Michalis Vazirgiannis,et al.  Clustering validity assessment: finding the optimal partitioning of a data set , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[13]  S. Dudoit,et al.  A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[14]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[15]  J. C. Dunn,et al.  A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters , 1973 .

[16]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[17]  Subhash C. Bagui,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2005, Technometrics.

[18]  Ricardo J. G. B. Campello,et al.  Design of OBF-TS Fuzzy Models Based on Multiple Clustering Validity Criteria , 2007, 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007).

[19]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Ludmila I. Kuncheva,et al.  Combining Pattern Classifiers: Methods and Algorithms , 2004 .

[21]  Joachim M. Buhmann,et al.  Stability-Based Validation of Clustering Solutions , 2004, Neural Computation.

[22]  Sergios Theodoridis,et al.  Pattern Recognition, Third Edition , 2006 .

[23]  Jun Zhang,et al.  A Model-Fitting Approach to Cluster Validation with Application to Stochastic Model-Based Image Segmentation , 1990, IEEE Trans. Pattern Anal. Mach. Intell..