SMVC: semi-supervised multi-view clustering in subspace projections

Since data is often multi-faceted in its very nature, it might not adequately be summarized by just a single clustering. To better capture the data's complexity, methods aiming at the detection of multiple, alternative clusterings have been proposed. Independent of this research area, semi-supervised clustering techniques have shown to substantially improve clustering results for single-view clustering by integrating prior knowledge. In this paper, we join both research areas and present a solution for integrating prior knowledge in the process of detecting multiple clusterings. We propose a Bayesian framework modeling multiple clusterings of the data by multiple mixture distributions, each responsible for an individual set of relevant dimensions. In addition, our model is able to handle prior knowledge in the form of instance-level constraints indicating which objects should or should not be grouped together. Since a priori the assignment of constraints to specific views is not necessarily known, our technique automatically determines their membership. For efficient learning, we propose the algorithm SMVC using variational Bayesian methods. With experiments on various real-world data, we demonstrate SMVC's potential to detect multiple clustering views and its capability to improve the result by exploiting prior knowledge.

[1]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[2]  Jörg Sander,et al.  Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering , 2008, KDD.

[3]  Qiang Fu,et al.  Multiplicative Mixture Models for Overlapping Clustering , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[4]  Christos Faloutsos,et al.  Mixed Membership Subspace Clustering , 2013, 2013 IEEE 13th International Conference on Data Mining.

[5]  James Bailey,et al.  Generating multiple alternative clusterings via globally optimal subspaces , 2014, Data Mining and Knowledge Discovery.

[6]  James Bailey,et al.  Generation of Alternative Clusterings Using the CAMI Approach , 2010, SDM.

[7]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[8]  Zhengdong Lu,et al.  Semi-supervised Learning with Penalized Probabilistic Clustering , 2004, NIPS.

[9]  Thomas Seidl,et al.  Finding density-based subspace clusters in graphs with feature vectors , 2012, Data Mining and Knowledge Discovery.

[10]  Inderjit S. Dhillon,et al.  Simultaneous Unsupervised Learning of Disparate Clusterings , 2008, Stat. Anal. Data Min..

[11]  Tomer Hertz,et al.  Computing Gaussian Mixture Models with EM Using Equivalence Constraints , 2003, NIPS.

[12]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .

[13]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[14]  Ira Assent,et al.  External evaluation measures for subspace clustering , 2011, CIKM '11.

[15]  Ian Davidson,et al.  Constrained Clustering: Advances in Algorithms, Theory, and Applications , 2008 .

[16]  Arindam Banerjee,et al.  Active Semi-Supervision for Pairwise Constrained Clustering , 2004, SDM.

[17]  Ian Davidson,et al.  A principled and flexible framework for finding alternative clusterings , 2009, KDD.

[18]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[19]  James Bailey,et al.  A framework to uncover multiple alternative clusterings , 2013, Machine Learning.

[20]  Charu C. Aggarwal A human-computer interactive method for projected clustering , 2004, IEEE Transactions on Knowledge and Data Engineering.

[21]  Emmanuel Müller,et al.  Discovering Multiple Clustering Solutions: Grouping Objects in Different Views of the Data , 2010, 2012 IEEE 28th International Conference on Data Engineering.

[22]  Emmanuel Müller,et al.  Detection of orthogonal concepts in subspaces of high dimensional data , 2009, CIKM.

[23]  Joydeep Ghosh,et al.  Model-based overlapping clustering , 2005, KDD '05.

[24]  Ying Cui,et al.  Non-redundant Multi-view Clustering via Orthogonalization , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[25]  James Bailey,et al.  A hierarchical information theoretic technique for the discovery of non linear alternative clusterings , 2010, KDD.

[26]  Thomas Seidl,et al.  Multi-view clustering using mixture models in subspace projections , 2012, KDD.

[27]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[28]  Tao Chen,et al.  Variable Selection in Model-Based Clustering: To Do or To Facilitate , 2010, ICML.

[29]  Ian Davidson,et al.  Two approaches to understanding when constraints help clustering , 2012, KDD.

[30]  James Bailey,et al.  COALA: A Novel Approach for the Extraction of an Alternate Clustering of High Quality and High Dissimilarity , 2006, Sixth International Conference on Data Mining (ICDM'06).

[31]  Michael I. Jordan,et al.  Multiple Non-Redundant Spectral Clustering Views , 2010, ICML.