Research Report an Information-theoretic External Cluster-validity Measure an Information-theoretic External Cluster-validity Measure 2. the Evaluation Problem

This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. It has been issued as a Research Report for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and speciic requests. After outside publication, requests should be lled only by reprints or legally obtained copies of the article (e.g., payment of royalties). ABSTRACT: In this paper we propose a measure of similarity/association between two partitions of a set of objects. Our motivation is the desire to use the measure to characterize the quality or accuracy of clustering algorithms by somehow comparing the clusters they produce with \ground truth" consisting of classes assigned to the patterns by manual means or some other means in whose veracity there is conndence. Such measures are referred to as \external". Our measure also allows clusterings with diierent numbers of clusters to be compared in a quantitative and principled way. Our evaluation scheme quantitatively measures how useful the cluster labels of the patterns are as predictors of their class labels. When all clusterings to be compared have the same number of clusters, the measure is equivalent to the mutual information between the cluster labels and the class labels. In cases where the numbers of clusters are diierent, however, it computes the reduction in the number of bits that would be required to encode (compress) the class labels if both the encoder and decoder have free access to the cluster labels. To achieve this encoding the estimated conditional probabilities of the class labels given the cluster labels must also be encoded. These estimated probabilities can be seen as a \model" for the class labels and their associated code length as a \model cost". In addition to deening the measure we compare it to other commonly used external measures and demonstrate its superiority as judged by certain criteria. The most common unsupervised-learning problem is clustering, in which we are given a set of objects or patterns = f! i ji = 1; 2; : : :; ng and each object has a representation x i x(! i) in some feature space 1 , which is frequently treated as an m-dimensional continuum R m. Some of the features may be categorical, however. The goal in clustering is to group …

[1]  Charles T. Zahn,et al.  Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters , 1971, IEEE Transactions on Computers.

[2]  L. A. Goodman,et al.  Measures of association for cross classifications , 1979 .

[3]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[4]  G. W. Milligan,et al.  The Effect of Cluster Size, Dimensionality, and the Number of Clusters on Recovery of True Cluster Structure , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  L. Hubert,et al.  Quadratic assignment as a general data analysis strategy. , 1976 .

[6]  L. A. Goodman,et al.  Measures of Association for Cross Classifications. II: Further Discussion and References , 1959 .

[7]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[8]  Shivakumar Vaithyanathan,et al.  Hierarchical Unsupervised Learning , 2000, International Conference on Machine Learning.

[9]  A. Kolmogorov Three approaches to the quantitative definition of information , 1968 .

[10]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[11]  Qian Huang,et al.  Quantitative methods of evaluating image segmentation , 1995, Proceedings., International Conference on Image Processing.

[12]  P. Sneath,et al.  Numerical Taxonomy , 1962, Nature.

[13]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[14]  Shivakumar Vaithyanathan,et al.  Generalized Model Selection for Unsupervised Learning in High Dimensions , 1999, NIPS.

[15]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[16]  Shivakumar Vaithyanathan,et al.  Model Selection in Unsupervised Learning with Applications To Document Clustering , 1999, International Conference on Machine Learning.

[17]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[18]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[19]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .