Network models with applications to genomic data: generalization, validation and uncertainty assessment

The aim of this thesis is to provide a framework for the estimation and analysis of transcription networks in human cancer. The methods we develop are applied to data collected by The Cancer Genome Atlas (TCGA) and supporting simulations are based on derived models in order to reflect real data structure. Nevertheless, our proposed models apply to network construction for any data type. The thesis includes four papers, all of them adressing different aspects of network estimation. Statistical analysis of high-dimensional data requires regularization. Network model validation amounts to selection of regularization parameters which control sparsity and, possibly, some common structure across different data classes (here, types of cancer). In paper I we present a bootstrap-based method to perform sparsity selection and robust network construction. We show, by simulation studies, that our proposed methods select sparsity to control false positive rate, rather than match the size of the true underlying network. In paper II we address the problem of uncertainty in network estimation. Since network estimation is very unstable, uncertainty is an important issue to focus on, in order to avoid overintepretation of results. Using ideas from information theory, we introduce a method that assesses uncertainty by presenting a set of network candidate estimates, rather than a single network model. The method enables us to show that different network topologies have different estimation properties, and that each network estimation method's performance depends on this topology. It is often of interest to identify and study the commonalities and differences in network estimates across several classes (here, types of cancer) and data types. Statistical network models, like the graphical lasso, provide a framework in which several classes and data types can be integrated. Paper III makes use of such framework and presents a method that allows for large scale sparse inverse covariance estimation of several classes. Through application of priors, we account for plausible connections across different data types. The proposed method also encourages the expected modular structure of biological networks and corrects for unbalanced sample sizes across classes. The estimated networks are part of a publicly accessible resource termed Cancer Landscapes (\url{cancerlandscapes.org}), which provides a setting for interactive analysis in relation of pathway and pharmacological databases, diagnoses, survival associations and drug targets. Traditionally, the analysis of genomic data has aimed for the study of differential expression. In paper IV we propose a way to integrate differential expression analysis with network estimation. To that end we extend upon existing methods in order to jointly estimate sparse mean vectors and precision matrices across several classes, thus gaining over analyses that focus on one or the other. Additionally, by assuming a block diagonal structure in the precision matrices, the problem can be recast into an ensemble classifier where each block becomes part of either a linear or a quadratic discriminant function.

[1]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[2]  Xiaohui Chen,et al.  BNArray: an R package for constructing gene regulatory networks from microarray data by using Bayesian network. , 2006, Bioinformatics.

[3]  H. Bondell,et al.  Simultaneous Regression Shrinkage, Variable Selection, and Supervised Clustering of Predictors with OSCAR , 2008, Biometrics.

[4]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[5]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[6]  Chris Wiggins,et al.  ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context , 2004, BMC Bioinformatics.

[7]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[8]  Andy M. Yip,et al.  Gene network interconnectedness and the generalized topological overlap measure , 2007, BMC Bioinformatics.

[9]  D. Pe’er,et al.  An Integrated Approach to Uncover Drivers of Cancer , 2010, Cell.

[10]  Torbjörn E. M. Nordling,et al.  Network modeling of the transcriptional effects of copy number aberrations in glioblastoma , 2011, Molecular systems biology.

[11]  Howard Y. Chang,et al.  Genetic regulators of large-scale transcriptional signatures in cancer , 2006, Nature Genetics.

[12]  T. Golub,et al.  Integrative genomic analyses identify MITF as a lineage survival oncogene amplified in malignant melanoma , 2005, Nature.

[13]  Ji Zhu,et al.  Regularized Multivariate Regression for Identifying Master Predictors with Application to Integrative Genomics Study of Breast Cancer. , 2008, The annals of applied statistics.

[14]  Holger Hoefling A Path Algorithm for the Fused Lasso Signal Approximator , 2009, 0910.0526.

[15]  Alexandre d'Aspremont,et al.  Model Selection Through Sparse Max Likelihood Estimation Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data , 2022 .

[16]  Mehmet Koyutürk,et al.  An Integrative -omics Approach to Identify Functional Sub-Networks in Human Colorectal Cancer , 2010, PLoS Comput. Biol..

[17]  N. Slavov,et al.  Correlation signature of the macroscopic states of the gene regulatory network in cancer , 2009, Proceedings of the National Academy of Sciences.

[18]  R. Tibshirani,et al.  Sparse estimation of a covariance matrix. , 2011, Biometrika.

[19]  Donald B. Johnson,et al.  Efficient Algorithms for Shortest Paths in Sparse Networks , 1977, J. ACM.

[20]  Marco Grzegorczyk,et al.  Comparative evaluation of reverse engineering gene regulatory networks with relevance networks, graphical gaussian models and bayesian networks , 2006, Bioinform..

[21]  George Michailidis,et al.  Transcriptional and metabolic data integration and modeling for identification of active pathways. , 2012, Biostatistics.

[22]  Kevin Murphy,et al.  Bayes net toolbox for Matlab , 1999 .

[23]  Alexandre d'Aspremont,et al.  First-Order Methods for Sparse Covariance Selection , 2006, SIAM J. Matrix Anal. Appl..

[24]  E. Levina,et al.  Joint estimation of multiple graphical models. , 2011, Biometrika.

[25]  Frank Emmert-Streib,et al.  Bagging Statistical Network Inference from Large-Scale Gene Expression Data , 2012, PloS one.

[26]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[27]  M. Newman,et al.  Finding community structure in very large networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[28]  R. Contreras,et al.  Complete nucleotide sequence of bacteriophage MS2 RNA: primary and secondary structure of the replicase gene , 1976, Nature.

[29]  Henry Tirri,et al.  B-Course: A Web-Based Tool for Bayesian and Causal Data Analysis , 2002, Int. J. Artif. Intell. Tools.

[30]  Rebecka Jörnsten,et al.  Simultaneous Model Selection via Rate-Distortion Theory, With Applications to Cluster and Significance Analysis of Gene Expression Data , 2009 .

[31]  D. Pe’er,et al.  Principles and Strategies for Developing Network Models in Cancer , 2011, Cell.

[32]  M. Yuan,et al.  Model selection and estimation in the Gaussian graphical model , 2007 .

[33]  Patrick Danaher,et al.  The joint graphical lasso for inverse covariance estimation across multiple classes , 2011, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[34]  Min Chen,et al.  Comparing Statistical Methods for Constructing Large Scale Gene Networks , 2012, PloS one.

[35]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[36]  K. Basso,et al.  A systems biology approach to prediction of oncogenes and molecular perturbation targets in B-cell lymphomas , 2008, Molecular systems biology.

[37]  Jian Guo,et al.  Modularized Gaussian Graphical Model , 2010 .

[38]  Steve Horvath,et al.  WGCNA: an R package for weighted correlation network analysis , 2008, BMC Bioinformatics.

[39]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[40]  Leonard M. Freeman,et al.  A set of measures of centrality based upon betweenness , 1977 .

[41]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.