Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance

Information theoretic measures form a fundamental class of measures for comparing clusterings, and have recently received increasing interest. Nevertheless, a number of questions concerning their properties and inter-relationships remain unresolved. In this paper, we perform an organized study of information theoretic measures for clustering comparison, including several existing popular measures in the literature, as well as some newly proposed ones. We discuss and prove their important properties, such as the metric property and the normalization property. We then highlight to the clustering community the importance of correcting information theoretic measures for chance, especially when the data size is small compared to the number of clusters present therein. Of the available information theoretic based measures, we advocate the normalized information distance (NID) as a general measure of choice, for it possesses concurrently several important properties, such as being both a metric and a normalized measure, admitting an exact analytical adjusted-for-chance form, and using the nominal [0,1] range better than other normalized variants.

[1]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[2]  Yasuichi Horibe,et al.  Entropy and correlation , 1985, IEEE Transactions on Systems, Man, and Cybernetics.

[3]  Tarald O. Kvålseth,et al.  Entropy and Correlation: Some Comments , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[4]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[5]  Byron Dom,et al.  An Information-Theoretic External Cluster-Validity Measure , 2002, UAI.

[6]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[7]  Alexander Kraskov,et al.  Published under the scientific responsability of the EUROPEAN PHYSICAL SOCIETY Incorporating , 2002 .

[8]  Marina Meila,et al.  Comparing Clusterings by the Variation of Information , 2003, COLT.

[9]  Carla E. Brodley,et al.  Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach , 2003, ICML.

[10]  Y. Yao,et al.  Information-Theoretic Measures for Knowledge Discovery and Data Mining , 2003 .

[11]  Bin Ma,et al.  The similarity metric , 2001, IEEE Transactions on Information Theory.

[12]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[13]  D. Steinley Properties of the Hubert-Arabie adjusted Rand index. , 2004, Psychological methods.

[14]  Marina Meila,et al.  Comparing clusterings: an axiomatic view , 2005, ICML.

[15]  Inderjit S. Dhillon,et al.  Clustering on the Unit Hypersphere using von Mises-Fisher Distributions , 2005, J. Mach. Learn. Res..

[16]  Venkatesan Guruswami,et al.  Clustering with qualitative information , 2005, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[17]  Ahmed Albatineh,et al.  On Similarity Indices and Correction for Chance Agreement , 2006, J. Classif..

[18]  Shai Ben-David,et al.  A Sober Look at Clustering Stability , 2006, COLT.

[19]  Zhiwen Yu,et al.  Graph-based consensus clustering for class discovery from gene expression data , 2007, Bioinform..

[20]  Srinivasan Parthasarathy,et al.  An ensemble framework for clustering protein-protein interaction networks , 2007, ISMB/ECCB.

[21]  M. Meilă Comparing clusterings---an information based distance , 2007 .

[22]  Kagan Tumer,et al.  Ensemble clustering with voting active clusters , 2008, Pattern Recognit. Lett..

[23]  M. Tan,et al.  Constructing Tumor Progression Pathways and Biomarker Discovery with Fuzzy Kernel Kmeans and DNA Methylation Data , 2008, Cancer informatics.

[24]  Zengyou He,et al.  k-ANMI: A mutual information based clustering algorithm for categorical data , 2005, Inf. Fusion.

[25]  M. Warrens On Similarity Coefficients for 2×2 Tables and Correction for Chance , 2008, Psychometrika.

[26]  Naftali Tishby,et al.  Model Selection and Stability in k-means Clustering , 2008, Annual Conference Computational Learning Theory.

[27]  James Bailey,et al.  Information theoretic measures for clusterings comparison: is a correction for chance necessary? , 2009, ICML '09.

[28]  Hui Xiong,et al.  Adapting the right measures for K-means clustering , 2009, KDD.

[29]  Xuan Vinh Nguyen,et al.  A Novel Approach for Automatic Number of Clusters Detection in Microarray Data Based on Consensus Clustering , 2009, 2009 Ninth IEEE International Conference on Bioinformatics and BioEngineering.

[30]  Vikas Singh,et al.  Ensemble clustering using semidefinite programming with applications , 2010, Machine Learning.

[31]  Hui Xiong,et al.  Information-Theoretic Distance Measures for Clustering Validation: Generalization and Normalization , 2009, IEEE Transactions on Knowledge and Data Engineering.

[32]  M. Manser,et al.  Chi-Squared Distribution , 2010 .