A boosting approach to structure learning of graphs with and without prior knowledge

MOTIVATION Identifying the network structure through which genes and their products interact can help to elucidate normal cell physiology as well as the genetic architecture of pathological phenotypes. Recently, a number of gene network inference tools have appeared based on Gaussian graphical model representations. Following this, we introduce a novel Boosting approach to learn the structure of a high-dimensional Gaussian graphical model motivated by the applications in genomics. A particular emphasis is paid to the inclusion of partial prior knowledge on the structure of the graph. With the increasing availability of pathway information and large-scale gene expression datasets, we believe that conditioning on prior knowledge will be an important aspect in raising the statistical power of structural learning algorithms to infer true conditional dependencies. RESULTS Our Boosting approach, termed BoostiGraph, is conceptually and algorithmically simple. It complements recent work on the network inference problem based on Lasso-type approaches. BoostiGraph is computationally cheap and is applicable to very high-dimensional graphs. For example, on graphs of order 5000 nodes, it is able to map out paths for the conditional independence structure in few minutes. Using computer simulations, we investigate the ability of our method with and without prior information to infer Gaussian graphical models from artificial as well as actual microarray datasets. The experimental results demonstrate that, using our method, it is possible to recover the true network topology with relatively high accuracy. AVAILABILITY This method and all other associated files are freely available from http://www.stats.ox.ac.uk/~anjum/.

[1]  K. Sachs,et al.  Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data , 2005, Science.

[2]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[3]  Michal Linial,et al.  Using Bayesian Networks to Analyze Expression Data , 2000, J. Comput. Biol..

[4]  Owen Carmichael,et al.  Learning Low-level Vision Learning Low-level Vision , 2000 .

[5]  Nir Friedman,et al.  Inferring Cellular Networks Using Probabilistic Graphical Models , 2004, Science.

[6]  H Kishino,et al.  Correspondence analysis of genes and tissue types and finding genetic links from microarray data. , 2000, Genome informatics. Workshop on Genome Informatics.

[7]  Ming Zhou,et al.  Regulation of Raf-1 by direct feedback phosphorylation. , 2005, Molecular cell.

[8]  Jean-Philippe Vert,et al.  Supervised reconstruction of biological networks with local models , 2007, ISMB/ECCB.

[9]  Alexandre d'Aspremont,et al.  Model Selection Through Sparse Max Likelihood Estimation Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data , 2022 .

[10]  Hiroyuki Toh,et al.  Inference of a genetic network by a combined approach of cluster analysis and graphical Gaussian modeling , 2002, Bioinform..

[11]  M. West,et al.  Sparse graphical models for exploring gene expression data , 2004 .

[12]  A. Lenkoski Bayesian structural learning and estimation in Gaussian graphical models , 2008 .

[13]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[14]  Mark Johnson,et al.  Mathematical Foundations of Speech and Language Processing , 2004 .

[15]  Marco Grzegorczyk,et al.  Comparative evaluation of reverse engineering gene regulatory networks with relevance networks, graphical gaussian models and bayesian networks , 2006, Bioinform..

[16]  Christopher K. I. Williams,et al.  Advances in Neural Information Processing Systems 15 (NIPS 2002) , 2002 .

[17]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[18]  Yoshihiro Yamanishi,et al.  Protein network inference from multiple genomic data: a supervised approach , 2004, ISMB/ECCB.

[19]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Michael A. West,et al.  Archival Version including Appendicies : Experiments in Stochastic Computation for High-Dimensional Graphical Models , 2005 .

[21]  Peter Buhlmann Boosting Methods: Why They Can Be Us eful for High-Dimensional Data , 2003 .

[22]  D. Pe’er,et al.  Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data , 2003, Nature Genetics.

[23]  Sach Mukherjee,et al.  Network inference using informative priors , 2008, Proceedings of the National Academy of Sciences.

[24]  Yoshihiro Yamanishi,et al.  Supervised Graph Inference , 2004, NIPS.

[25]  Wang Zhen-zhen,et al.  Using Bayesian Networks to Analyze Gene Expression Data , 2010 .

[26]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[27]  Rudolf Kruse,et al.  Graphical Models for Industrial Planning on Complex Domains , 2006, Decision Theory and Multi-Agent Planning.

[28]  C. Gualerzi,et al.  Identification of a cold shock transcriptional enhancer of the Escherichia coli gene encoding nucleoid protein H-NS. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[29]  B. Peter BOOSTING FOR HIGH-DIMENSIONAL LINEAR MODELS , 2006 .

[30]  S. Forst,et al.  Molecular analysis of the signaling pathway between EnvZ and OmpR in Escherichia coli , 1992, Journal of bacteriology.

[31]  A. Barabasi,et al.  Lethality and centrality in protein networks , 2001, Nature.

[32]  P. Bühlmann,et al.  Boosting with the L2-loss: regression and classification , 2001 .

[33]  M. Yuan Efficient Computation of ℓ1 Regularized Estimates in Gaussian Graphical Models , 2008 .

[34]  P. Bühlmann Boosting for high-dimensional linear models , 2006 .

[35]  M. Inouye,et al.  Acquirement of cold sensitivity by quadruple deletion of the cspA family and its suppression by PNPase S1 domain in Escherichia coli , 2001, Molecular microbiology.

[36]  M. Yuan,et al.  Model selection and estimation in the Gaussian graphical model , 2007 .

[37]  Korbinian Strimmer,et al.  An empirical Bayes approach to inferring large-scale gene association networks , 2005, Bioinform..

[38]  Alexandre d'Aspremont,et al.  Model Selection Through Sparse Maximum Likelihood Estimation , 2007, ArXiv.

[39]  Mark W. Schmidt,et al.  Learning Graphical Model Structure Using L1-Regularization Paths , 2007, AAAI.

[40]  P. Bühlmann,et al.  Boosting With the L2 Loss , 2003 .

[41]  Jeff A. Bilmes,et al.  Graphical models and automatic speech recognition , 2002 .

[42]  William Stafford Noble,et al.  Kernel methods for predicting protein-protein interactions , 2005, ISMB.

[43]  David Maxwell Chickering,et al.  Dependency Networks for Inference, Collaborative Filtering, and Data Visualization , 2000, J. Mach. Learn. Res..

[44]  S. Rafii,et al.  Splitting vessels: Keeping lymph apart from blood , 2003, Nature Medicine.

[45]  A. Wolffe,et al.  DNA gyrase, CS7.4, and the cold shock response in Escherichia coli , 1992, Journal of bacteriology.

[46]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[47]  David Heckerman,et al.  Dependency Networks for Density Estimation, Collaborative Filtering, and Data Visualization , 2000 .

[48]  W. Schmidt-Heck,et al.  Reverse Engineering of the Stress Response during Expression of a Recombinant Protein , 2004 .

[49]  Yiming Yang,et al.  Using Modified Lasso Regression to Learn Large Undirected Graphs in a Probabilistic Framework , 2005, AAAI.

[50]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[51]  M. Weigt,et al.  Gene-network inference by message passing , 2008, 0812.0936.

[52]  Olivier Chapelle,et al.  A taxonomy of semi-supervised learning algorithms , 2005 .

[53]  Tsuyoshi Kato,et al.  Selective integration of multiple biological data for supervised network inference , 2005, Bioinform..

[54]  R. Tibshirani,et al.  Additive Logistic Regression : a Statistical View ofBoostingJerome , 1998 .

[55]  T. Mizuno,et al.  A novel member of the cspA family of genes that is induced by cold shock in Escherichia coli , 1996, Journal of bacteriology.