Network clustering: probing biological heterogeneity by sparse graphical models

MOTIVATION Networks and pathways are important in describing the collective biological function of molecular players such as genes or proteins. In many areas of biology, for example in cancer studies, available data may harbour undiscovered subtypes which differ in terms of network phenotype. That is, samples may be heterogeneous with respect to underlying molecular networks. This motivates a need for unsupervised methods capable of discovering such subtypes and elucidating the corresponding network structures. RESULTS We exploit recent results in sparse graphical model learning to put forward a 'network clustering' approach in which data are partitioned into subsets that show evidence of underlying, subset-level network structure. This allows us to simultaneously learn subset-specific networks and corresponding subset membership under challenging small-sample conditions. We illustrate this approach on synthetic and proteomic data. AVAILABILITY go.warwick.ac.uk/sachmukherjee/networkclustering.

[1]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[2]  Frank Harary,et al.  Graph Theory , 2016 .

[3]  Alexandre d'Aspremont,et al.  Model Selection Through Sparse Max Likelihood Estimation Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data , 2022 .

[4]  Laurence A. Wolsey,et al.  Formulations and valid inequalities for the node capacitated graph partitioning problem , 1996, Math. Program..

[5]  Celso C. Ribeiro,et al.  Greedy Randomized Adaptive Search Procedures , 2003, Handbook of Metaheuristics.

[6]  Fred W. Glover,et al.  Clustering of Microarray data via Clique Partitioning , 2005, J. Comb. Optim..

[7]  Dorit S. Hochbaum,et al.  Approximation Algorithms for NP-Hard Problems , 1996 .

[8]  Sergiy Butenko,et al.  Novel Approaches for Analyzing Biological Networks , 2005, J. Comb. Optim..

[9]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[10]  Joni-Kristian Kämäräinen,et al.  Feature representation and discrimination based on Gaussian mixture model probability densities - Practices and algorithms , 2006, Pattern Recognit..

[11]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Jill Duncan,et al.  Analyzing microarray data using cluster analysis. , 2003, Pharmacogenomics.

[13]  L. Mirny,et al.  Protein complexes and functional modules in molecular networks , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Vijay V. Vazirani,et al.  Approximation Algorithms , 2001, Springer Berlin Heidelberg.

[15]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[16]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[17]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[18]  Anthony Ephremides,et al.  The Architectural Organization of a Mobile Radio Network via a Distributed Algorithm , 1981, IEEE Trans. Commun..

[19]  D. West Introduction to Graph Theory , 1995 .

[20]  Ron Shamir,et al.  An algorithm for clustering cDNAs for gene expression analysis , 1999, RECOMB.

[21]  Panos M. Pardalos,et al.  On maximum clique problems in very large graphs , 1999, External Memory Algorithms.

[22]  Jae K. Lee,et al.  Transcript and protein expression profiles of the NCI-60 cancer cell panel: an integromic microarray study , 2007, Molecular Cancer Therapeutics.

[23]  Michael A. Langston,et al.  Combinatorial Genetic Regulatory Network Analysis Tools for High Throughput Transcriptomic Data , 2005, Systems Biology and Regulatory Genomics.

[24]  Leonhard Held,et al.  Gaussian Markov Random Fields: Theory and Applications , 2005 .

[25]  P. Holme Core-periphery organization of complex networks. , 2005, Physical review. E, Statistical, nonlinear, and soft matter physics.

[26]  Gil Alterovitz,et al.  Knowledge-Based Bioinformatics: From analysis to interpretation , 2010 .

[27]  Ravindra K. Ahuja,et al.  Network Flows: Theory, Algorithms, and Applications , 1993 .

[28]  Sach Mukherjee,et al.  Network inference using informative priors , 2008, Proceedings of the National Academy of Sciences.

[29]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[30]  Peter Richtárik,et al.  Smooth minimization of nonsmooth functions with parallel coordinate descent methods , 2013, Modeling and Optimization: Theory and Applications.

[31]  Yi Pan,et al.  Improved K-means clustering algorithm for exploring local protein sequence motifs representing common structural property , 2005, IEEE Transactions on NanoBioscience.

[32]  T. Ideker,et al.  Network-based classification of breast cancer metastasis , 2007, Molecular systems biology.

[33]  A. Dobra Bayesian Covariance Selection ∗ , 2004 .

[34]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[35]  M. Jambu,et al.  Cluster analysis and data analysis , 1985 .

[36]  Laurence A. Wolsey,et al.  The node capacitated graph partitioning problem: A computational study , 1998, Math. Program..

[37]  M. West,et al.  Sparse graphical models for exploring gene expression data , 2004 .

[38]  R. Ravi,et al.  Approximating Maximum Leaf Spanning Trees in Almost Linear Time , 1998, J. Algorithms.

[39]  Valerie Guralnik,et al.  A scalable algorithm for clustering protein sequences , 2001, BIOKDD.

[40]  Kyoungrim Lee,et al.  Study of protein–protein interaction using conformational space annealing , 2005, Proteins.

[41]  R. Sharan,et al.  Cluster analysis and its applications to gene expression data. , 2002, Ernst Schering Research Foundation workshop.

[42]  Mark Gerstein,et al.  Training set expansion: an approach to improving the reconstruction of biological networks from limited and uneven reliable interactions , 2008, Bioinform..

[43]  Ehl Emile Aarts,et al.  Simulated annealing and Boltzmann machines , 2003 .

[44]  D. Shasha,et al.  A Gene Expression Map of the Arabidopsis Root , 2003, Science.

[45]  Michael A. West,et al.  Archival Version including Appendicies : Experiments in Stochastic Computation for High-Dimensional Graphical Models , 2005 .

[46]  Sandra Sudarsky,et al.  Massive Quasi-Clique Detection , 2002, LATIN.

[47]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[48]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[49]  T. Speed,et al.  Gaussian Markov Distributions over Finite Graphs , 1986 .

[50]  Arthur L. Liestman,et al.  CLUSTERING ALGORITHMS FOR AD HOC WIRELESS NETWORKS , 2004 .

[51]  Alexandre d'Aspremont,et al.  Model Selection Through Sparse Maximum Likelihood Estimation , 2007, ArXiv.

[52]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[53]  Michael I. Jordan Graphical Models , 2003 .

[54]  Martin A. Nowak,et al.  Inferring Cellular Networks Using Probabilistic Graphical Models , 2004 .

[55]  M. Palumbo,et al.  Patterns, structures, and amino acid frequencies in structural building blocks, a protein secondary structure classification scheme , 1997, Proteins: Structure, Function, and Bioinformatics.

[56]  Panos M. Pardalos,et al.  A New Heuristic for the Minimum Connected Dominating Set Problem on Ad Hoc Wireless Networks , 2004 .

[57]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[58]  Paul P. Wang,et al.  Advances to Bayesian network inference for generating causal networks from observational biological data , 2004, Bioinform..

[59]  Julien Gagneur,et al.  Modular decomposition of protein-protein interaction networks , 2004, Genome Biology.

[60]  John Scott Social Network Analysis , 1988 .

[61]  J. K. Lenstra,et al.  Local Search in Combinatorial Optimisation. , 1997 .

[62]  Tony Pawson,et al.  Comparative Analysis Reveals Conserved Protein Phosphorylation Networks Implicated in Multiple Diseases , 2009, Science Signaling.

[63]  M. Gerstein,et al.  Getting connected: analysis and principles of biological networks. , 2007, Genes & development.

[64]  Heping Zhang,et al.  Correcting the loss of cell-cycle synchrony in clustering analysis of microarray data using weights , 2004, Bioinform..

[65]  G. Nemhauser,et al.  The k-Domination and k-Stability Problems on Sun-Free Chordal Graphs , 1984 .

[66]  Gilbert Laporte,et al.  An exact algorithm for the maximum k-club problem in an undirected graph , 1999, Eur. J. Oper. Res..

[67]  Giorgio Gambosi,et al.  Complexity and approximation: combinatorial optimization problems and their approximability properties , 1999 .

[68]  David B. Shmoys,et al.  A unified approach to approximation algorithms for bottleneck problems , 1986, JACM.

[69]  Roded Sharan,et al.  Algorithmic approaches to clustering gene expression data , 2001 .

[70]  Stephen B. Seidman,et al.  Network structure and minimum degree , 1983 .

[71]  Gary D. Bader,et al.  An automated method for finding molecular complexes in large protein interaction networks , 2003, BMC Bioinformatics.

[72]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[73]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[74]  Shay Kutten,et al.  Fast Distributed Construction of Small k-Dominating Sets and Applications , 1998, J. Algorithms.

[75]  S. Guha,et al.  Approximation Algorithms for Connected Dominating Sets , 1998, Algorithmica.

[76]  David B. Shmoys,et al.  A Best Possible Heuristic for the k-Center Problem , 1985, Math. Oper. Res..

[77]  George L. Nemhauser,et al.  Min-cut clustering , 1993, Math. Program..

[78]  K. Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics , 2011 .

[79]  Michael A. Trick,et al.  Cliques and clustering: A combinatorial approach , 1998, Oper. Res. Lett..

[80]  K. Sachs,et al.  Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data , 2005, Science.

[81]  D. Pe’er,et al.  Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data , 2003, Nature Genetics.

[82]  Hsueh-I Lu,et al.  The Power of Local Optimization: Approximation Algorithms for Maximum-Leaf Spanning Tree , 2007 .

[83]  Sergiy Butenko,et al.  Graph Domination, Coloring and Cliques in Telecommunications , 2006, Handbook of Optimization in Telecommunications.

[84]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.

[85]  Mark Newman,et al.  Detecting community structure in networks , 2004 .

[86]  Ron Shamir,et al.  A clustering algorithm based on graph connectivity , 2000, Inf. Process. Lett..

[87]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[88]  D Husmeier,et al.  Reverse engineering of genetic networks with Bayesian networks. , 2003, Biochemical Society transactions.

[89]  J. Lafferty,et al.  High-dimensional Ising model selection using ℓ1-regularized logistic regression , 2010, 1010.0311.