Multiple Clustering Views via Constrained Projections

It is well known that off-the-shelf clustering methods may discover different patterns in a given set of data. This is because each clustering algorithm has its own bias resulting from the optimization of different criteria. Furthermore, there is no ground truth against which the clustering result can be validated. Thus, no crossvalidation technique can be carried out to tune input parameters involved in the process. As a consequence, the user has no guidelines for choosing the proper clustering method for a given data set. The use of clustering ensembles has emerged as a technique for overcoming these problems. A clustering ensemble consists of different clusterings obtained from multiple applications of any single algorithm with different initializations, or from various bootstrap samples of the available data, or from the application of different algorithms to the same data set. Clustering ensembles offer a solution to challenges inherent to clustering arising from its ill-posed nature: they can provide more robust and stable solutions by making use of the consensus across multiple clustering results, while averaging out emergent spurious structures that arise due to the various biases to which each participating algorithm is tuned, or to the variance induced by different data samples. Another issue related to clustering is the so-called curse of dimensionality. Data with thousands of dimensions abound in fields and applications as diverse as bioinformatics, security and intrusion detection, and information and image retrieval. Clustering algorithms can handle data with low dimensionality, but as the dimensionality of the data increases, these algorithms tend to break down. This is because in high dimensional spaces data become extremely sparse and are far apart from each other. A common scenario with high-dimensional data is that several clusters may exist in different subspaces comprised of different combinations of features. In many real-world problems, points in a given region of the input space may cluster along a given set of dimensions, while points located in another region may ∗Department of Computer Science, George Mason University, carlotta@cs.gmu.edu form a tight group with respect to different dimensions. Each dimension could be relevant to at least one of the clusters. Common global dimensionality reduction techniques are unable to capture such local structure of the data. Thus, a proper feature selection procedure should operate locally in the input space. Local feature selection allows one to estimate to which degree features participate in the discovery of clusters. As a result, many different subspace clustering methods have been proposed. Traditionally, clustering ensembles and subspace clustering have been developed independently of one another. Clustering ensembles address the ill-posed nature of clustering, but don’t address in general the curse of dimensionality problem. Subspace clustering avoids the curse of dimensionality in high-dimensional spaces, but typically requires the setting of critical input parameters whose values are unknown. To overcome these limitations we have introduced a unified framework that is capable of handling both issues: the ill-posed nature of clustering and the curse of dimensionality. Addressing these two issues is nontrivial as it involves solving a new problem altogether: the subspace clustering ensemble problem. Our approach takes two different perspectives: in the one case we model the problem as a multiand single-objective optimization one [3, 2, 1]; in the other we take a generative view, and assume that the base clusterings are generated from a hidden consensus clustering of the data [5, 4]. Both directions are promising and lead to interesting challenges. The first can yield general and efficient solutions, but requires as input the number of clusters in the consensus clustering. The second has higher complexity, but provides a principled solution to the “How many clusters?” question. In this talk, I focus on the first approach. I introduce the formal definition of the problem of subspace clustering ensembles, and heuristics to solve it. The objective is to define methods to exploit the information provided by an ensemble of subspace clustering solutions to compute a robust consensus subspace clustering. The problem is formulated as a multiand single-objective optimization problem where the objective functions embed both sides of the ensemble components: the data clusterings and the assignments of features to clusters.

[1]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[2]  Joshua Zhexue Huang,et al.  A New Initialization Method for Clustering Categorical Data , 2007, PAKDD.

[3]  Ana L. N. Fred,et al.  Data clustering using evidence accumulation , 2002, Object recognition supported by user interaction for service robots.

[4]  Petko Bakalov,et al.  FlexTrack: A System for Querying Flexible Patterns in Trajectory Databases , 2011, SSTD.

[5]  Hans-Peter Kriegel,et al.  Evaluation of Multiple Clustering Solutions , 2011, MultiClust@ECML/PKDD.

[6]  Anil K. Jain,et al.  A Mixture Model for Clustering Ensembles , 2004, SDM.

[7]  Jiri Matas,et al.  Spatial and Feature Space Clustering: Applications in Image Analysis , 1995, CAIP.

[8]  Zhengxin Chen,et al.  An iterative initial-points refinement algorithm for categorical data clustering , 2002, Pattern Recognit. Lett..

[9]  Andrea Tagarelli,et al.  Advancing data clustering via projective clustering ensembles , 2011, SIGMOD '11.

[10]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[11]  Emmanuel Müller,et al.  Discovering Multiple Clustering Solutions: Grouping Objects in Different Views of the Data , 2010, 2012 IEEE 28th International Conference on Data Engineering.

[12]  Anthony K. H. Tung,et al.  Carpenter: finding closed patterns in long biological datasets , 2003, KDD '03.

[13]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[14]  C. A. Murthy,et al.  Density-Based Multiscale Data Condensation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[15]  Arthur Zimek,et al.  When pattern met subspace cluster a relationship story , 2011 .

[16]  Seungjin Choi,et al.  Independent Component Analysis , 2009, Handbook of Natural Computing.

[17]  H. Ralambondrainy,et al.  A conceptual version of the K-means algorithm , 1995, Pattern Recognit. Lett..

[18]  Rong Jin,et al.  Distance Metric Learning: A Comprehensive Survey , 2006 .

[19]  Chris Clifton,et al.  Privacy-preserving distributed mining of association rules on horizontally partitioned data , 2004, IEEE Transactions on Knowledge and Data Engineering.

[20]  Xiaoli Z. Fern,et al.  Cluster Ensemble Selection , 2008, Stat. Anal. Data Min..

[21]  Carlotta Domeniconi,et al.  Subspace Metric Ensembles for Semi-supervised Clustering of High Dimensional Data , 2006, ECML.

[22]  Ying Cui,et al.  Non-redundant Multi-view Clustering via Orthogonalization , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[23]  Inderjit S. Dhillon,et al.  Simultaneous Unsupervised Learning of Disparate Clusterings , 2008, Stat. Anal. Data Min..

[24]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[25]  Xuan Vinh Nguyen,et al.  minCEntropy: A Novel Information Theoretic Approach for the Generation of Alternative Clusterings , 2010, 2010 IEEE International Conference on Data Mining.

[26]  A. Zimek,et al.  Subspace Clustering, Ensemble Clustering, Alternative Clustering, Multiview Clustering: What Can We Learn From Each Other? , 2010 .

[27]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[28]  Mirco Musolesi,et al.  Sensing meets mobile social networks: the design, implementation and evaluation of the CenceMe application , 2008, SenSys '08.

[29]  Euripides G. M. Petrakis,et al.  Similarity Searching in Medical Image Databases , 1997, IEEE Trans. Knowl. Data Eng..

[30]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[31]  Jimeng Sun,et al.  DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[32]  Gary Chinga,et al.  Paper Surface Characterisation by Laser Profilometry and Image Analysis , 2003 .

[33]  Liu Yang An Overview of Distance Metric Learning , 2007 .

[34]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[35]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[36]  Michael C. Hout,et al.  Multidimensional Scaling , 2003, Encyclopedic Dictionary of Archaeology.

[37]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[38]  Xiaofei He,et al.  Locality Preserving Projections , 2003, NIPS.

[39]  Joshua Zhexue Huang,et al.  A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining , 1997, DMKD.

[40]  Andrea Tagarelli,et al.  Enhancing Single-Objective Projective Clustering Ensembles , 2010, 2010 IEEE International Conference on Data Mining.

[41]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[42]  Yücel Saygin,et al.  Towards trajectory anonymization: a generalization-based approach , 2008, SPRINGL '08.

[43]  Deniz Erdogmus,et al.  Information Theoretic Learning , 2005, Encyclopedia of Artificial Intelligence.

[44]  Anil K. Jain,et al.  Multiobjective data clustering , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[45]  Jagat Narain Kapur,et al.  Measures of information and their applications , 1994 .

[46]  Zengyou He,et al.  A cluster ensemble method for clustering categorical data , 2005, Information Fusion.

[47]  Xun Yi,et al.  Semi-Trusted Mixer Based Privacy Preserving Distributed Data Mining for Resource Constrained Devices , 2010, ArXiv.

[48]  Edwin Diday,et al.  Symbolic clustering using a new dissimilarity measure , 1991, Pattern Recognit..

[49]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[50]  Tomer Hertz,et al.  Learning Distance Functions using Equivalence Relations , 2003, ICML.

[51]  Shehroz S. Khan,et al.  Cluster center initialization algorithm for K-means clustering , 2004, Pattern Recognit. Lett..

[52]  Aristides Gionis,et al.  Clustering aggregation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[53]  James Bailey,et al.  Generation of Alternative Clusterings Using the CAMI Approach , 2010, SDM.

[54]  Dimitrios Gunopulos,et al.  Data Clustering on a Network of Mobile Smartphones , 2011, 2011 IEEE/IPSJ International Symposium on Applications and the Internet.

[55]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[56]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[57]  Dimitrios Gunopulos,et al.  Scheduling for real-time mobile MapReduce systems , 2011, DEBS '11.

[58]  Donald E. Knuth,et al.  Dancing links , 2000, cs/0011047.

[59]  James Bailey,et al.  A hierarchical information theoretic technique for the discovery of non linear alternative clusterings , 2010, KDD.

[60]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[61]  Shehroz S. Khan,et al.  Computation of Initial Modes for K-modes Clustering Algorithm Using Evidence Accumulation , 2007, IJCAI.

[62]  Ian Davidson,et al.  A principled and flexible framework for finding alternative clusterings , 2009, KDD.

[63]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[64]  Wolfgang Lehner,et al.  Evolving Ensemble-Clustering to a Feedback-Driven Process , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[65]  Robert M. Haralick,et al.  Textural Features for Image Classification , 1973, IEEE Trans. Syst. Man Cybern..

[66]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[67]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[68]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[69]  James Bailey,et al.  Generating multiple alternative clusterings via globally optimal subspaces , 2014, Data Mining and Knowledge Discovery.

[70]  Gal Chechik,et al.  Extracting Relevant Structures with Side Information , 2002, NIPS.

[71]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[72]  Dimitrios Gunopulos,et al.  Disclosure-Free GPS Trace Search in Smartphone Networks , 2011, 2011 IEEE 12th International Conference on Mobile Data Management.

[73]  Ian Davidson,et al.  Finding Alternative Clusterings Using Constraints , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[74]  Shehroz S. Khan,et al.  Computing Initial points using Density Based Multiscale Data Condensation for Clustering Categorical data , 2003 .

[75]  Zengyou He,et al.  Farthest-Point Heuristic based Initialization Methods for K-Modes Clustering , 2006, ArXiv.

[76]  Michael I. Jordan,et al.  Multiple Non-Redundant Spectral Clustering Views , 2010, ICML.

[77]  James Bailey,et al.  COALA: A Novel Approach for the Extraction of an Alternate Clustering of High Quality and High Dissimilarity , 2006, Sixth International Conference on Data Mining (ICDM'06).

[78]  Ira Assent,et al.  Less is More: Non-Redundant Subspace Clustering , 2010 .

[79]  Vipin Kumar,et al.  Partitioning-based clustering for Web document categorization , 1999, Decis. Support Syst..

[80]  Alexandre V. Evfimievski,et al.  Privacy preserving mining of association rules , 2002, Inf. Syst..

[81]  Ira Assent,et al.  A Framework for Evaluation and Exploration of Clustering Algorithms in Subspaces of High Dimensional Databases , 2011, BTW.

[82]  Wolfgang Lehner,et al.  Browsing Robust Clustering-Alternatives , 2011, MultiClust@ECML/PKDD.

[83]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[84]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[85]  Kathryn B. Laskey,et al.  Feature Enriched Nonparametric Bayesian Co-clustering , 2012, PAKDD.

[86]  Jiye Liang,et al.  A new initialization method for categorical data clustering , 2009, Expert Syst. Appl..