A Separability Index for Distance-based Clustering and Classification Algorithms

We propose a separability index that quantifies the degree of difficulty in a hard clustering problem under assumptions of a multivariate Gaussian distribution for each cluster. A preliminary index is first defined and several of its properties are explored both theoretically and numerically. Adjustments are then made to this index so that the final refinement is also interpretable in terms of the Adjusted Rand Index between a true grouping and its hypothetical idealized clustering, taken as a surrogate of clustering complexity. Our derived index is used to develop a data-simulation algorithm that generates samples according to the prescribed value of the index. This algorithm is particularly useful for systematically generating datasets with varying degrees of clustering difficulty which can be used to evaluate performance of different clustering algorithms. The index is also shown to be useful in providing a summary of the distinctiveness of classes in grouped datasets.

[1]  Ranjan Maitra,et al.  Simulating Data to Study Performance of Finite Mixture Modeling and Clustering Algorithms , 2010 .

[2]  Christian Hennig,et al.  Methods for merging Gaussian mixture components , 2010, Adv. Data Anal. Classif..

[3]  M. Brusco,et al.  A variable-selection heuristic for K-means clustering , 2001 .

[4]  G. W. Milligan,et al.  A Study of the Comparability of External Criteria for Hierarchical Cluster Analysis. , 1986, Multivariate behavioral research.

[5]  M. Kanehisa,et al.  Expert system for predicting protein localization sites in gram‐negative bacteria , 1991, Proteins.

[6]  Gilles Celeux,et al.  Combining Mixture Components for Clustering , 2010, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[7]  Ranjan Maitra Initializing Partition-Optimization Algorithms , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[8]  G. W. Milligan,et al.  A study of standardization of variables in cluster analysis , 1988 .

[9]  Enrique H. Ruspini,et al.  Numerical methods for fuzzy clustering , 1970, Inf. Sci..

[10]  P. Jaccard,et al.  Etude comparative de la distribution florale dans une portion des Alpes et des Jura , 1901 .

[11]  Varghese S. Jacob,et al.  A study of the classification capabilities of neural networks using unsupervised learning: A comparison withK-means clustering , 1994 .

[12]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[13]  B. Vandeginste,et al.  PARVUS: An extendable package of programs for data exploration, classification and correlation, M. Forina, R. Leardi, C. Armanino and S. Lanteri, Elsevier, Amsterdam, 1988, Price: US $645 ISBN 0‐444‐43012‐1 , 1990 .

[14]  D. B. Ramey,et al.  Nonparametric Clustering Techniques , 2006 .

[15]  Georges G. Grinstein,et al.  DNA visual and analytic data mining , 1997 .

[16]  R. Davies The distribution of a linear combination of 2 random variables , 1980 .

[17]  Harry Joe,et al.  Generation of Random Clusters with Specified Degree of Separation , 2006, J. Classif..

[18]  N. Campbell,et al.  A multivariate study of variation in two species of rock crab of the genus Leptograpsus , 1974 .

[19]  Phil Brodatz,et al.  Textures: A Photographic Album for Artists and Designers , 1966 .

[20]  L. Hubert,et al.  Comparing partitions , 1985 .

[21]  R. Blashfield,et al.  A Nearest-Centroid Technique for Evaluating the Minimum-Variance Clustering Procedure. , 1980 .

[22]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[23]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[24]  Ranjan Maitra,et al.  Clustering in the Presence of Scatter , 2009, Biometrics.

[25]  Robert Henson,et al.  OCLUS: An Analytic Method for Generating Clusters with Known Overlap , 2005, J. Classif..

[26]  Sanjoy Dasgupta,et al.  Learning mixtures of Gaussians , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[27]  Brian Everitt,et al.  Cluster analysis , 1974 .

[28]  D. N. Geary Mixture Models: Inference and Applications to Clustering , 1989 .

[29]  J. Hartigan Statistical theory in clustering , 1985 .

[30]  Ben J. A. Kröse,et al.  Efficient Greedy Learning of Gaussian Mixture Models , 2003, Neural Computation.

[31]  Jon R. Kettenring,et al.  The Practice of Cluster Analysis , 2006, J. Classif..

[32]  G. W. Milligan,et al.  The Effect of Cluster Size, Dimensionality, and the Number of Clusters on Recovery of True Cluster Structure , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  D. Cox,et al.  An Analysis of Transformations , 1964 .

[34]  Adrian E. Raftery,et al.  Model-Based Clustering, Discriminant Analysis, and Density Estimation , 2002 .

[35]  Ranjan Maitra,et al.  A re-defined and generalized percent-overlap-of-activation measure for studies of fMRI reproducibility and its use in identifying outlier activation maps , 2010, NeuroImage.

[36]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[37]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[38]  Nikos A. Vlassis,et al.  A variational (E)(M) algorithm for large-scale mixture modeling , 2003 .

[39]  Klaus Nordhausen,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition by Trevor Hastie, Robert Tibshirani, Jerome Friedman , 2009 .

[40]  G. W. Milligan,et al.  An algorithm for generating artificial test clusters , 1985 .

[41]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[42]  Nikos A. Vlassis,et al.  The global k-means clustering algorithm , 2003, Pattern Recognit..

[43]  Harry Joe,et al.  Separation index and partial membership for clustering , 2006, Comput. Stat. Data Anal..