An Efficient Approach for Computing Silhouette Coefficients

One popular approach for finding the best number of clusters (K) in a data set is through computing the silhouette coefficients. The silhouette coefficients for different values of K, are first found and then the maximum value of these coefficients is chosen. However, computing the silhouette coefficient for different Ks is a very time consuming process. This is due to the amount of CPU time spent on distance calculations. A proposed approach to compute the silhouette coefficient quickly had been presented. The approach was based on decreasing the number of addition operations when computing distances. The results were efficient and more than 50% of the CPU time was achieved when applied to different data sets.

[1]  T. Klastorin The p-Median Problem for Cluster Analysis: A Comparative Test Using the Mixture Model Approach , 1985 .

[2]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[3]  Vijayalakshmi Atluri,et al.  Neighborhood based detection of anomalies in high dimensional spatio-temporal sensor datasets , 2004, SAC '04.

[4]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[5]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[6]  Massimo Sassaroli,et al.  Protein Particles in Chlamydomonas Flagella Undergo a Transport Cycle Consisting of Four Phases , 2001, The Journal of cell biology.

[7]  Ji Hyea Han,et al.  Data Mining : Concepts and Techniques 2 nd Edition Solution Manual , 2005 .

[8]  John F. Roddick,et al.  Geographic Data Mining and Knowledge Discovery , 2001 .

[9]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[10]  André Hardy,et al.  An examination of procedures for determining the number of clusters in a data set , 1994 .

[11]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[12]  Jiawei Han,et al.  Geographic Data Mining and Knowledge Discovery , 2001 .

[13]  N. B. Venkateswarlu,et al.  A new fast classifier for remotely sensed imagery , 1993 .

[14]  Frank Plastria,et al.  Non-hierarchical clustering with masloc , 1983, Pattern Recognit..

[15]  Christopher Leckie,et al.  An Evaluation of Criteria for Measuring the Quality of Clusters , 1999, IJCAI.