Comparing subspace clusterings

We present the first framework for comparing subspace clusterings. We propose several distance measures for subspace clusterings, including generalizations of well-known distance measures for ordinary clusterings. We describe a set of important properties for any measure for comparing subspace clusterings and give a systematic comparison of our proposed measures in terms of these properties. We validate the usefulness of our subspace clustering distance measures by comparing clusterings produced by the algorithms FastDOC, HARP, PROCLUS, ORCLUS, and SSPC. We show that our distance measures can be also used to compare partial clusterings, overlapping clusterings, and patterns in binary data matrices.

[1]  Michael K. Ng,et al.  HARP: a practical projected clustering algorithm , 2004, IEEE Transactions on Knowledge and Data Engineering.

[2]  Rich Caruana,et al.  Meta Clustering , 2006, Sixth International Conference on Data Mining (ICDM'06).

[3]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[4]  Gene H. Golub,et al.  Numerical methods for computing angles between linear subspaces , 1971, Milestones in Matrix Computation.

[5]  Céline Robardet,et al.  Mining alpha/beta concepts as relevant bi-sets from transactional data , 2004 .

[6]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[7]  Marina Meila,et al.  Comparing Clusterings by the Variation of Information , 2003, COLT.

[8]  J. Friedman,et al.  Clustering objects on subsets of attributes (with discussion) , 2004 .

[9]  Nabil H. Mustafa,et al.  k-means projective clustering , 2004, PODS.

[10]  Boris Mirkin,et al.  Mathematical Classification and Clustering , 1996 .

[11]  Yuk Lap Yip,et al.  HARP: a practical projected clustering algorithm for mining gene expression data , 2003 .

[12]  Zlatko Drmac,et al.  On Principal Angles between Subspaces of Euclidean Space , 2000, SIAM J. Matrix Anal. Appl..

[13]  Yi Zhang,et al.  Entropy-based subspace clustering for mining numerical data , 1999, KDD '99.

[14]  J. Friedman Clustering objects on subsets of attributes , 2002 .

[15]  Michael K. Ng,et al.  On discovery of extremely low-dimensional clusters using semi-supervised projected clustering , 2005, 21st International Conference on Data Engineering (ICDE'05).

[16]  Cheng Yang,et al.  Efficient discovery of error-tolerant frequent itemsets in high dimensions , 2001, KDD '01.

[17]  S. Dudoit,et al.  A prediction-based resampling method for estimating the number of clusters in a dataset , 2002, Genome Biology.

[18]  Roded Sharan,et al.  Discovering statistically significant biclusters in gene expression data , 2002, ISMB.

[19]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[20]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[21]  Marina Meila,et al.  Comparing clusterings: an axiomatic view , 2005, ICML.

[22]  Aristides Gionis,et al.  Clustering aggregation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[23]  M. J. van der Laan,et al.  Statistical inference for simultaneous clustering of gene expression data. , 2002, Mathematical biosciences.

[24]  Christian Böhm,et al.  Computing Clusters of Correlation Connected objects , 2004, SIGMOD '04.

[25]  Inderjit S. Dhillon,et al.  A generalized maximum entropy approach to bregman co-clustering and matrix approximation , 2004, J. Mach. Learn. Res..

[26]  Dimitrios Gunopulos,et al.  Subspace Clustering of High Dimensional Data , 2004, SDM.

[27]  Jiong Yang,et al.  A framework for ontology-driven subspace clustering , 2004, KDD.

[28]  Philip S. Yu,et al.  /spl delta/-clusters: capturing subspace correlation in a large data set , 2002, Proceedings 18th International Conference on Data Engineering.

[29]  Allen S. Mandel Comment … , 1978, British heart journal.

[30]  Ata Kabán,et al.  Learning to Read Between the Lines: The Aspect Bernoulli Model , 2004, SDM.

[31]  Inderjit S. Dhillon,et al.  Minimum Sum-Squared Residue Co-Clustering of Gene Expression Data , 2004, SDM.

[32]  Ana L. N. Fred,et al.  Analysis of consensus partition in cluster ensemble , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[33]  Kp Ng,et al.  A Review on Projected Clustering Algorithms , 2003 .

[34]  Isabelle Guyon,et al.  A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[35]  Anil K. Jain,et al.  A Mixture Model for Clustering Ensembles , 2004, SDM.

[36]  Anil K. Jain,et al.  Adaptive clustering ensembles , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[37]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[38]  Eytan Domany,et al.  Resampling Method for Unsupervised Estimation of Cluster Validity , 2001, Neural Computation.

[39]  T. M. Murali,et al.  A Monte Carlo algorithm for fast projective clustering , 2002, SIGMOD '02.

[40]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[41]  Hans-Peter Kriegel,et al.  Density-Connected Subspace Clustering for High-Dimensional Data , 2004, SDM.

[42]  Huan Liu,et al.  Evaluating Subspace Clustering Algorithms , 2004 .

[43]  Joachim M. Buhmann,et al.  Landscape of clustering algorithms , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[44]  G. Getz,et al.  Coupled two-way clustering analysis of gene microarray data. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[45]  Avraham A. Melkman,et al.  Sleeved coclustering , 2004, KDD '04.

[46]  Kenneth Steiglitz,et al.  Combinatorial Optimization: Algorithms and Complexity , 1981 .

[47]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[48]  Dana Ron,et al.  A New Conceptual Clustering Framework , 2004, Machine Learning.

[49]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[50]  Joachim M. Buhmann,et al.  Stability-Based Validation of Clustering Solutions , 2004, Neural Computation.

[51]  Joseph T. Chang,et al.  Spectral biclustering of microarray data: coclustering genes and conditions. , 2003, Genome research.

[52]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD '00.

[53]  P. Jaccard THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .

[54]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[55]  Aristides Gionis,et al.  Geometric and Combinatorial Tiles in 0-1 Data , 2004, PKDD.

[56]  Anne Patrikainen,et al.  Subspace clustering of high-dimensional bi-nary data-a probabilistic approach , 2004 .

[57]  Heikki Mannila,et al.  Dense itemsets , 2004, KDD.

[58]  J. Hartigan Direct Clustering of a Data Matrix , 1972 .