Adapting the right measures for K-means clustering

Clustering validation is a long standing challenge in the clustering literature. While many validation measures have been developed for evaluating the performance of clustering algorithms, these measures often provide inconsistent information about the clustering performance and the best suitable measures to use in practice remain unknown. This paper thus fills this crucial void by giving an organized study of 16 external validation measures for K-means clustering. Specifically, we first introduce the importance of measure normalization in the evaluation of the clustering performance on data with imbalanced class distributions. We also provide normalization solutions for several measures. In addition, we summarize the major properties of these external measures. These properties can serve as the guidance for the selection of validation measures in different application scenarios. Finally, we reveal the interrelationships among these external measures. By mathematical transformation, we show that some validation measures are equivalent. Also, some measures have consistent validation performances. Most importantly, we provide a guide line to select the most suitable validation measures for K-means clustering.

[1]  Marina Meila,et al.  Comparing clusterings: an axiomatic view , 2005, ICML.

[2]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[3]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[4]  Edward R. Dougherty,et al.  Model-based evaluation of clustering validation measures , 2007, Pattern Recognit..

[5]  Hui Xiong,et al.  A Generalization of Proximity Functions for K-Means , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[6]  W. Hoeffding,et al.  Rank Correlation Methods , 1949 .

[7]  Lawrence Hubert Nominal scale response agreement as a generalized correlation , 1977 .

[8]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[9]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[10]  R. Mooney,et al.  Impact of Similarity Measures on Web-page Clustering , 2000 .

[11]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[12]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[13]  Michalis Vazirgiannis,et al.  Cluster validity methods: part I , 2002, SGMD.

[14]  I. Guyon,et al.  Detecting stable clusters using principal component analysis. , 2003, Methods in molecular biology.

[15]  M. Degroot,et al.  Probability and Statistics , 2021, Examining an Operational Approach to Teaching Probability.

[16]  R. Forthofer,et al.  Rank Correlation Methods , 1981 .

[17]  L. G. Davis,et al.  Basic methods in molecular biology , 1986 .

[18]  S. Dongen Performance criteria for graph clustering and Markov cluster experiments , 2000 .

[19]  George Karypis,et al.  Empirical and Theoretical Comparisons of Selected Criterion Functions for Document Clustering , 2004, Machine Learning.

[20]  Leo A. Goodman,et al.  Corrigenda: Measures of Association for Cross Classifications , 1957 .

[21]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[22]  Hui Xiong,et al.  K-means clustering versus validation measures: a data distribution perspective , 2006, KDD '06.

[23]  L. A. Goodman,et al.  Measures of Association for Cross Classifications III: Approximate Sampling Theory , 1963 .

[24]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[25]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[26]  Boris Mirkin,et al.  Mathematical Classification and Clustering , 1996 .