Clustering validity assessment using multi representatives

Clustering is a mostly unsupervised procedure and the majority of the clustering algorithms depend on certain assumptions in order to define the subgroups present in a data set. Moreover, they may behave in a different way depending on the features of the data set and their input parameters values. Therefore, in most applications the resulting clustering scheme requires some sort of evaluation as regards its validity. In this paper a new validity index, CDbw, is defined based on well-known clustering criteria enabling: i. the finding of the optimal input parameters’ values for a clustering algorithm that results in the optimal partitioning of a data set, ii. the selection of the algorithm that provides the optimal partitioning of a data set. CDbw puts emphasis on the geometric features of clusters, handling efficiently arbitrary shaped clusters. It achieves this by representing each cluster by a certain fixed number of clusters rather than a single center point. Our experimental results confirm the reliability of our index showing that it performs favorably in all cases selecting independently of clustering algorithm the scheme that best fits the data under consideration.

[1]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[2]  J. Dunn Well-Separated Clusters and Optimal Fuzzy Partitions , 1974 .

[3]  G. W. Milligan,et al.  The Effect of Cluster Size, Dimensionality, and the Number of Clusters on Recovery of True Cluster Structure , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Isak Gath,et al.  Unsupervised Optimal Fuzzy Clustering , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Gerardo Beni,et al.  A Validity Measure for Fuzzy Clustering , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  André Hardy,et al.  An examination of procedures for determining the number of clusters in a data set , 1994 .

[7]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[8]  Ramasamy Uthurusamy,et al.  Data mining and knowledge discovery in databases , 1996, CACM.

[9]  Rajesh N. Davé,et al.  Validating fuzzy partitions obtained through c-shells clustering , 1996, Pattern Recognit. Lett..

[10]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[11]  Padhraic Smyth,et al.  Clustering Using Monte Carlo Cross-Validation , 1996, KDD.

[12]  Joshua Zhexue Huang,et al.  A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining , 1997, DMKD.

[13]  Michael J. A. Berry,et al.  Data mining techniques - for marketing, sales, and customer support , 1997, Wiley computer publishing.

[14]  Hans-Peter Kriegel,et al.  Incremental Clustering for Mining in a Data Warehousing Environment , 1998, VLDB.

[15]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[16]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[17]  Boudewijn P. F. Lelieveldt,et al.  A new cluster validity index for the fuzzy c-mean , 1998, Pattern Recognit. Lett..

[18]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[19]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[20]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[21]  Michalis Vazirgiannis,et al.  Quality Scheme Assessment in the Clustering Process , 2000, PKDD.

[22]  Michalis Vazirgiannis,et al.  Clustering validity assessment: finding the optimal partitioning of a data set , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[23]  Gregory Piatetsky-Shapiro,et al.  Advances in Knowledge Discovery and Data Mining , 2004, Lecture Notes in Computer Science.