CGC: A Flexible and Robust Approach to Integrating Co-Regularized Multi-Domain Graph for Clustering

Multi-view graph clustering aims to enhance clustering performance by integrating heterogeneous information collected in different domains. Each domain provides a different view of the data instances. Leveraging cross-domain information has been demonstrated an effective way to achieve better clustering results. Despite the previous success, existing multi-view graph clustering methods usually assume that different views are available for the same set of instances. Thus instances in different domains can be treated as having strict one-to-one relationship. In many real-life applications, however, data instances in one domain may correspond to multiple instances in another domain. Moreover, relationships between instances in different domains may be associated with weights based on prior (partial) knowledge. In this paper, we propose a flexible and robust framework, CGC (Co-regularized Graph Clustering), based on non-negative matrix factorization (NMF), to tackle these challenges. CGC has several advantages over the existing methods. First, it supports many-to-many cross-domain instance relationship. Second, it incorporates weight on cross-domain relationship. Third, it allows partial cross-domain mapping so that graphs in different domains may have different sizes. Finally, it provides users with the extent to which the cross-domain instance relationship violates the in-domain clustering structure, and thus enables users to re-evaluate the consistency of the relationship. We develop an efficient optimization method that guarantees to find the global optimal solution with a given confidence requirement. The proposed method can automatically identify noisy domains and assign smaller weights to them. This helps to obtain optimal graph partition for the focused domain. Extensive experimental results on UCI benchmark data sets, newsgroup data sets and biological interaction networks demonstrate the effectiveness of our approach.

[1]  Xiang Zhang,et al.  TEAM: efficient two-locus epistasis tests in human genome-wide association study , 2010, Bioinform..

[2]  Pedro Larrañaga,et al.  Estimation of Distribution Algorithms , 2002, Genetic Algorithms and Evolutionary Computation.

[3]  Yizhou Sun,et al.  Ranking-based clustering of heterogeneous information networks with star network schema , 2009, KDD.

[4]  Qiang Yang,et al.  Self-taught clustering , 2008, ICML '08.

[5]  Riccardo Poli,et al.  Particle swarm optimization , 1995, Swarm Intelligence.

[6]  Wei Tang,et al.  Clustering with Multiple Graphs , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[7]  M. Daly,et al.  PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes , 2003, Nature Genetics.

[8]  Ran El-Yaniv,et al.  Multi-way distributional clustering via pairwise interactions , 2005, ICML.

[9]  Hal Daumé,et al.  A Co-training Approach for Multi-view Spectral Clustering , 2011, ICML.

[10]  Wei Cheng,et al.  Inferring novel associations between SNP sets and gene sets in eQTL study using sparse graphical model , 2012, BCB.

[11]  Jieping Ye,et al.  Multi-objective Multi-view Spectral Clustering via Pareto Optimization , 2013, SDM.

[12]  Sham M. Kakade,et al.  Multi-view clustering via canonical correlation analysis , 2009, ICML '09.

[13]  Jun Dong,et al.  Geometric Interpretation of Gene Coexpression Network Analysis , 2008, PLoS Comput. Biol..

[14]  Hal Daumé,et al.  Co-regularized Multi-view Spectral Clustering , 2011, NIPS.

[15]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[16]  H. Sebastian Seung,et al.  Algorithms for Non-negative Matrix Factorization , 2000, NIPS.

[17]  Yizhou Sun,et al.  RankClus: integrating clustering with ranking for heterogeneous information network analysis , 2009, EDBT '09.

[18]  Carla E. Brodley,et al.  Solving cluster ensemble problems by bipartite graph partitioning , 2004, ICML.

[19]  Fionn Murtagh,et al.  Cluster Dissection and Analysis: Theory, Fortran Programs, Examples. , 1986 .

[20]  J. Booth,et al.  Resampling-Based Multiple Testing. , 1994 .

[21]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[22]  H. Spath Cluster Dissection and Analysis , 1985 .

[23]  S. Dongen A cluster algorithm for graphs , 2000 .

[24]  D. Hand Cluster dissection and analysis: Helmuth SPATH Wiley, Chichester, 1985, 226 pages, £25.00 , 1986 .

[25]  Yizhou Sun,et al.  Mining Heterogeneous Information Networks: Principles and Methodologies , 2012, Mining Heterogeneous Information Networks: Principles and Methodologies.

[26]  Jiawei Han,et al.  ACM Transactions on Knowledge Discovery from Data: Introduction , 2007 .

[27]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[28]  Arindam Banerjee,et al.  Multi-way Clustering on Relation Graphs , 2007, SDM.

[29]  J. A. Lozano,et al.  Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation , 2001 .

[30]  Xin Liu,et al.  Document clustering based on non-negative matrix factorization , 2003, SIGIR.

[31]  Srinivasan Parthasarathy,et al.  An ensemble framework for clustering protein-protein interaction networks , 2007, ISMB/ECCB.

[32]  Xiaofeng Zhu,et al.  Genome-wide searching of rare genetic variants in WTCCC data , 2010, Human Genetics.

[33]  Jiawei Han,et al.  Learning a Kernel for Multi-Task Clustering , 2011, AAAI.

[34]  Andreas Krause,et al.  Cost-effective outbreak detection in networks , 2007, KDD '07.

[35]  Chris H. Q. Ding,et al.  Orthogonal nonnegative matrix t-factorizations for clustering , 2006, KDD '06.

[36]  Steffen Bickel,et al.  Multi-view clustering , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[37]  Huan Liu,et al.  Community detection via heterogeneous interaction analysis , 2012, Data Mining and Knowledge Discovery.

[38]  Ian Davidson,et al.  Flexible constrained spectral clustering , 2010, KDD.

[39]  Zili Zhang,et al.  Protein Function Prediction by Integrating Multiple Kernels , 2013, IJCAI.

[40]  Philip S. Yu,et al.  A General Model for Multiple View Unsupervised Learning , 2008, SDM.

[41]  Yizhou Sun,et al.  Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models , 2009, NIPS.

[42]  Ling Liu,et al.  Social influence based clustering of heterogeneous information networks , 2013, KDD.

[43]  Chris H. Q. Ding,et al.  Symmetric Nonnegative Matrix Factorization for Graph Clustering , 2012, SDM.

[44]  H. Cordell Detecting gene–gene interactions that underlie human diseases , 2009, Nature Reviews Genetics.

[45]  Christopher J. C. Burges,et al.  Spectral clustering and transductive learning with multiple views , 2007, ICML '07.

[46]  Fred W. Glover,et al.  The general employee scheduling problem. An integration of MS and AI , 1986, Comput. Oper. Res..

[47]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[48]  Bert L. de Groot,et al.  Detection of Functional Modes in Protein Dynamics , 2009, PLoS Comput. Biol..