A Validity Index for Prototype-Based Clustering of Data Sets With Complex Cluster Structures

Evaluation of how well the extracted clusters fit the true partitions of a data set is one of the fundamental challenges in unsupervised clustering because the data structure and the number of clusters are unknown a priori. Cluster validity indices are commonly used to select the best partitioning from different clustering results; however, they are often inadequate unless clusters are well separated or have parametrical shapes. Prototype-based clustering (finding of clusters by grouping the prototypes obtained by vector quantization of the data), which is becoming increasingly important for its effectiveness in the analysis of large high-dimensional data sets, adds another dimension to this challenge. For validity assessment of prototype-based clusterings, previously proposed indexes-mostly devised for the evaluation of point-based clusterings-usually perform poorly. The poor performance is made worse when the validity indexes are applied to large data sets with complicated cluster structure. In this paper, we propose a new index, Conn_Index, which can be applied to data sets with a wide variety of clusters of different shapes, sizes, densities, or overlaps. We construct Conn_Index based on inter- and intra-cluster connectivities of prototypes. Connectivities are defined through a “connectivity matrix”, which is a weighted Delaunay graph where the weights indicate the local data distribution. Experiments on synthetic and real data indicate that Conn_Index outperforms existing validity indices, used in this paper, for the evaluation of prototype-based clustering results.

[1]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[2]  B. Csatho,et al.  Knowledge discovery in urban environments from fused multi-dimensional imagery , 2007, 2007 Urban Remote Sensing Joint Event.

[3]  Thomas Martinetz,et al.  Topology representing networks , 1994, Neural Networks.

[4]  James C. Bezdek,et al.  On cluster validity for the fuzzy c-means model , 1995, IEEE Trans. Fuzzy Syst..

[5]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[6]  Doheon Lee,et al.  On cluster validity index for estimation of the optimal number of fuzzy clusters , 2004, Pattern Recognit..

[7]  Ujjwal Maulik,et al.  Performance Evaluation of Some Clustering Algorithms and Validity Indices , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Ricardo J. G. B. Campello,et al.  A Robust Methodology for Comparing Performances of Clustering Validity Criteria , 2008, SBIA.

[9]  E. Merényi,et al.  A new cluster validity index for prototype based clustering algorithms based on inter- and intra-cluster density , 2007, 2007 International Joint Conference on Neural Networks.

[10]  Erzsébet Merényi,et al.  Exploiting Data Topology in Visualization and Clustering of Self-Organizing Maps , 2009, IEEE Transactions on Neural Networks.

[11]  Michalis Vazirgiannis,et al.  A density-based cluster validity approach using multi-representatives , 2008, Pattern Recognit. Lett..

[12]  Thomas Villmann,et al.  Neural maps in remote sensing image analysis , 2003, Neural Networks.

[13]  Ujjwal Maulik,et al.  Validity index for crisp and fuzzy clusters , 2004, Pattern Recognit..

[14]  Minho Kim,et al.  New indices for cluster validity assessment , 2005, Pattern Recognit. Lett..

[15]  Thomas Villmann,et al.  Explicit Magnification Control of Self-Organizing Maps for “Forbidden” Data , 2007, IEEE Transactions on Neural Networks.

[16]  Daewon Lee,et al.  An improved cluster labeling method for support vector clustering , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Esa Alhoniemi,et al.  Clustering of the self-organizing map , 2000, IEEE Trans. Neural Networks Learn. Syst..

[18]  Gerardo Beni,et al.  A Validity Measure for Fuzzy Clustering , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Shengrui Wang,et al.  An objective approach to cluster validation , 2006, Pattern Recognit. Lett..

[20]  James C. Bezdek,et al.  Some new indexes of cluster validity , 1998, IEEE Trans. Syst. Man Cybern. Part B.

[21]  Michalis Vazirgiannis,et al.  On Clustering Validation Techniques , 2001, Journal of Intelligent Information Systems.

[22]  Wang Jeen-Shing,et al.  A Cluster Validity Measure With Outlier Detection for Support Vector Clustering , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[23]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[24]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[25]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[26]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[27]  Sankar K. Pal,et al.  Rough Set Based Generalized Fuzzy $C$ -Means Algorithm and Quantitative Indices , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[28]  Lili Zhang,et al.  Learning Highly Structured Manifolds: Harnessing the Power of SOMs , 2009, Similarity-Based Clustering.

[29]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.