An inferential framework for network hypothesis tests: With applications to biological networks

AN INFERENTIAL FRAMEWORK FORNETWORKHYPOTHESIS TESTS: WITH APPLICATIONS TO BIOLOGICAL NETWORKS By Phillip D. Yates, Ph.D. A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy at Virginia Commonwealth University. Virginia Commonwealth University, 2010. Major Director: Nitai D. Mukhopadhyay, Assistant Professor, Department of Biostatistics The analysis of weighted co-expression gene sets is gaining momentum in systems biology. In addition to substantial research directed toward inferring co-expression networks on the basis of microarray/high-throughput sequencing data, inferential methods are being developed to compare gene networks across one or more phenotypes. Common gene set hypothesis testing procedures are mostly confined to comparing average gene/node transcription levels between one or more groups and make limited use of additional network features, e.g., edges induced by significant partial correlations. Ignoring the gene set architecture disregards relevant network topological comparisons and can result in familiar n ≪ p over-parameterized test issues. In this dissertation we propose a method for performing oneand two-sample hypothesis tests for (weighted) networks. We build on a measure of separation defined via a local neighborhood metric. This node-centered additive metric exploits the network properties of nearby neighbors. The use of local neighborhoods seeks to lessen the effect of a large number of (potentially) estimable parameters; biology or algorithms are commonly used to further reduce the prospect of spurious biological associations. Where possible, we avoid specifying dubious network probability models. In order to draw statistical inferences we use a resampling approach. Our method allows for both an overall network test and a post hoc examination of individual gene/node effects. We evaluate our approach using both simulated data and microarray data obtained from diabetes and ovarian cancer studies.

[1]  Dennis D. Boos,et al.  Bootstrap Critical Values for Testing Homogeneity of Covariance Matrices , 1992 .

[2]  K. T. Compton The American Institute of Physics , 1933 .

[3]  Kwang-Il Goh,et al.  Graphical Analysis of Biocomplex Networks and Transport Phenomena , 2006 .

[4]  Béla Bollobás,et al.  Modern Graph Theory , 2002, Graduate Texts in Mathematics.

[5]  Eugene V. Koonin,et al.  Power Laws, Scale-Free Networks and Genome Biology , 2006 .

[6]  D E Weeks,et al.  Nonparametric simulation-based statistics for detecting linkage in general pedigrees. , 1996, American journal of human genetics.

[7]  Marc-Thorsten Hütt,et al.  Consistency analysis of metabolic correlation networks , 2007, BMC Systems Biology.

[8]  Béla Bollobás,et al.  Random Graphs and Branching Processes , 2008 .

[9]  M. Yuan,et al.  Model selection and estimation in the Gaussian graphical model , 2007 .

[10]  Benoit B. Mandelbrot,et al.  Fractal Geometry of Nature , 1984 .

[11]  Tom Lenaerts,et al.  Is Scale-Free A Realistic Topology For Evolving Biochemical Networks? , 2005 .

[12]  G. Caldarelli,et al.  Community structure from spectral properties in complex networks , 2005 .

[13]  Susumu Goto,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 2000, Nucleic Acids Res..

[14]  J. Ross Determination of complex reaction mechanisms. Analysis of chemical, biological and genetic networks. , 2005, The journal of physical chemistry. A.

[15]  Matthias Dehmer,et al.  Detecting Pathological Pathways of a Complex Disease by a Comparative Analysis of Networks , 2008 .

[16]  M. Pellegrini,et al.  Protein Interaction Networks , 2004, Expert review of proteomics.

[17]  Alessandro Vespignani,et al.  Dynamical Processes on Complex Networks , 2008 .

[18]  Christian V. Forst,et al.  Algebraic comparison of metabolic networks, phylogenetic inference, and metabolic innovation , 2006, BMC Bioinformatics.

[19]  Andy M. Yip,et al.  Gene network interconnectedness and the generalized topological overlap measure , 2007, BMC Bioinformatics.

[20]  Hongyu Zhao,et al.  Are scale-free networks robust to measurement errors? , 2005, BMC Bioinformatics.

[21]  Korbinian Strimmer,et al.  From correlation to causation networks: a simple approximate learning algorithm and its application to high-dimensional plant gene expression data , 2007, BMC Systems Biology.

[22]  C. D. Cutler,et al.  A REVIEW OF THE THEORY AND ESTIMATION OF FRACTAL DIMENSION , 1993 .

[23]  Phillip D. Yates,et al.  RCMAT: a regularized covariance matrix approach to testing gene sets , 2009, BMC Bioinformatics.

[24]  B. Noble Applied Linear Algebra , 1969 .

[25]  Jaques Reifman,et al.  Evidence of probabilistic behaviour in protein interaction networks , 2008, BMC Systems Biology.

[26]  Kathleen M. Carley,et al.  Nonparametric inference for network data , 1993 .

[27]  Xujing Wang,et al.  TAPPA: topological analysis of pathway phenotype association , 2007, Bioinform..

[28]  Jian-Bing Fan,et al.  Analysis of gene expression in stage I serous tumors identifies critical pathways altered in ovarian cancer. , 2009, Gynecologic oncology.

[29]  Jens Nielsen,et al.  Architecture of transcriptional regulatory circuits is knitted over the topology of bio-molecular interaction networks , 2008, BMC Systems Biology.

[30]  P. Good Permutation, Parametric, and Bootstrap Tests of Hypotheses , 2005 .

[31]  Charles DeLisi,et al.  Binding Site Graphs: A New Graph Theoretical Framework for Prediction of Transcription Factor Binding Sites , 2007, PLoS Comput. Biol..

[32]  Johannes Berg,et al.  Cross-species analysis of biological networks by Bayesian alignment. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Steve Horvath,et al.  WGCNA: an R package for weighted correlation network analysis , 2008, BMC Bioinformatics.

[34]  Dannie Durand,et al.  Sequence Similarity Network Reveals Common Ancestry of Multidomain Proteins , 2008, PLoS Comput. Biol..

[35]  With contributions from , 2007 .

[36]  Kim Sneppen,et al.  Functional Alignment of Regulatory Networks: A Study of Temperate Phages , 2005, PLoS Comput. Biol..

[37]  R. Durrett Random Graph Dynamics: References , 2006 .

[38]  Marti J. Anderson,et al.  Distance‐Based Tests for Homogeneity of Multivariate Dispersions , 2006, Biometrics.

[39]  Luca Cardelli,et al.  A Compositional Approach to the Stochastic Dynamics of Gene Networks , 2006, Trans. Comp. Sys. Biology.

[40]  N. MacDonald,et al.  Trees and networks in biological models , 1983 .

[41]  Gregory M. Constantine,et al.  Metric Models for Random Graphs , 1998 .

[42]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[43]  Jun Dong,et al.  Understanding network concepts in modules , 2007, BMC Systems Biology.

[44]  Anthony C. Davison,et al.  Bootstrap Methods and Their Application , 1998 .

[45]  Guido Caldarelli,et al.  Large Scale Structure and Dynamics of Complex Networks: From Information Technology to Finance and Natural Science , 2007 .

[46]  Andrea Califano,et al.  Reverse engineering biological networks. Opportunities and challenges in computational methods for pathway inference. Proceedings of the workshop entitled Dialogue on Reverse Engineering Assessment and Methods (DREAM). September 7-8, 2006. Bronx, New York, USA. , 2007, Annals of the New York Academy of Sciences.

[47]  P. Khatri,et al.  A systems biology approach for pathway level analysis. , 2007, Genome research.

[48]  Ted G. Lewis,et al.  Network Science: Theory and Applications , 2009 .

[49]  D. Aldous Stochastic models and descriptive statistics for phylogenetic trees, from Yule to today , 2001 .

[50]  Wei Pan,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm612 Systems biology , 2022 .

[51]  J. F. F. Mendes Science of complex networks : from biology to the Internet and WWW : CNET 2004 : Aveiro, Portugal, 29 August-2 September, 2004 , 2005 .

[52]  Alain Guénoche,et al.  Trees and proximity representations , 1991, Wiley-Interscience series in discrete mathematics and optimization.

[53]  Lawrence B. Holder,et al.  Mining Graph Data , 2006 .

[54]  Stefan Bornholdt,et al.  Handbook of Graphs and Networks: From the Genome to the Internet , 2003 .

[55]  Limsoon Wong,et al.  Exploiting indirect neighbours and topological weight to predict protein function from protein--protein interactions , 2006 .

[56]  Robert W. Mee,et al.  Fractional factorial designs that restrict the number of treatment combinations for factor subsets , 2000 .

[57]  Ralf Steuer,et al.  Global Network Properties , 2007 .

[58]  T. W. Anderson An Introduction to Multivariate Statistical Analysis , 1959 .

[59]  Soumya Raychaudhuri Computational text analysis for funtional genomics and bioinformatics , 2006 .

[60]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[61]  Olga G. Troyanskaya,et al.  Nearest Neighbor Networks: clustering expression data based on gene neighborhoods , 2007, BMC Bioinformatics.

[62]  Gesine Reinert,et al.  Predicting and Validating Protein Interactions Using Network Structure , 2008, PLoS Comput. Biol..

[63]  S. Holmes,et al.  Bootstrapping Phylogenetic Trees: Theory and Methods , 2003 .

[64]  Lixing Zhu,et al.  RESAMPLING METHODS FOR HOMOGENEITY TESTS OF COVARIANCE MATRICES , 2002 .

[65]  K. Helin,et al.  E2F target genes: unraveling the biology. , 2004, Trends in biochemical sciences.

[66]  P. Sen,et al.  Nonparametric methods in multivariate analysis , 1974 .

[67]  J. H. Steiger,et al.  The comparison of interdependent correlations between optimal linear composites , 1984 .

[68]  Wilfrid S. Kendall,et al.  Networks and Chaos - Statistical and Probabilistic Aspects , 1993 .

[69]  Fan Chung,et al.  Spectral Graph Theory , 1996 .

[70]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[71]  W. Krzanowski Distance between populations using mixed continuous and categorical variables , 1983 .

[72]  Guido Caldarelli,et al.  Scale-Free Networks , 2007 .

[73]  George A. F. Seber,et al.  A matrix handbook for statisticians , 2007 .

[74]  S. Horvath,et al.  Statistical Applications in Genetics and Molecular Biology , 2011 .

[75]  Arnold L. Rosenberg,et al.  Graph Separators, with Applications , 2001, Frontiers of Computer Science.

[76]  John C. Gower,et al.  Analysis of distance for structured multivariate data and extensions to multivariate analysis of variance , 1999 .

[77]  Hongzhe Li,et al.  A Markov random field model for network-based analysis of genomic data , 2007, Bioinform..

[78]  Vladimir Filkov,et al.  Exploring biological network structure using exponential random graph models , 2007, Bioinform..

[79]  Aidong Zhang,et al.  Protein Interaction Networks: Computational Analysis , 2009 .

[80]  Mark Bieda,et al.  Unbiased location analysis of E2F1-binding sites suggests a widespread role for E2F1 in the human genome. , 2006, Genome research.

[81]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[82]  P. Diaconis,et al.  Matchings and phylogenetic trees. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[83]  Jari Saramäki,et al.  Characterizing Motifs in Weighted Complex Networks , 2005 .

[84]  Mark Gerstein,et al.  Protein Interaction Prediction by Integrating Genomic Features and Protein Interaction Network Analysis , 2005, Data Analysis and Visualization in Genomics and Proteomics.

[85]  Alfonso Valencia,et al.  Applications of Text Mining in Molecular Biology, from Name Recognition to Protein Interaction Maps , 2005, Data Analysis and Visualization in Genomics and Proteomics.

[86]  Wojtek J. Krzanowski,et al.  Permutational tests for correlation matrices , 1993 .

[87]  Hiroyuki Toh,et al.  Inference of a genetic network by a combined approach of cluster analysis and graphical Gaussian modeling , 2002, Bioinform..

[88]  F. Chung,et al.  Complex Graphs and Networks , 2006 .

[89]  Kinetic Theory of Random Graphs , 2005, cond-mat/0503420.

[90]  Jianhong Wu,et al.  Data clustering - theory, algorithms, and applications , 2007 .

[91]  Eric D. Kolaczyk,et al.  Statistical Analysis of Network Data: Methods and Models , 2009 .

[92]  F. Sheldon,et al.  THE METRIC PROPERTIES OF DNA-DNA HYBRIDIZATION DISSIMILARITY MEASURES , 1989 .

[93]  Francisco C. Santos,et al.  Network Dependence of the Dilemmas Of Cooperation , 2005 .

[94]  Michael S. Waterman,et al.  Computational Genome Analysis: An Introduction , 2007 .

[95]  Caroline C. Friedel,et al.  Inferring topology from clustering coefficients in protein-protein interaction networks , 2006, BMC Bioinformatics.

[96]  R. Beran,et al.  Bootstrap Tests and Confidence Regions for Functions of a Covariance Matrix , 1985 .

[97]  Steve Horvath,et al.  Network neighborhood analysis with the multi-node topological overlap measure , 2007, Bioinform..

[98]  Geometric and Probabilistic Aspects of Statistical Distance Functions , 1982 .

[99]  Carsten Wiuf,et al.  Statistical Model Selection Methods Applied to Biological Networks , 2005, Trans. Comp. Sys. Biology.

[100]  Ju Han Kim,et al.  Identifying set-wise differential co-expression in gene expression microarray data , 2009, BMC Bioinformatics.

[101]  M. Drton,et al.  Multiple Testing and Error Control in Gaussian Graphical Model Selection , 2005, math/0508267.

[102]  Kathleen M. Carley,et al.  Some Simple Algorithms for Structural Comparison , 2005, Comput. Math. Organ. Theory.

[103]  D. Boos,et al.  Testing hypotheses about covariance matrices using bootstrap methods , 1993 .

[104]  Marco Grzegorczyk,et al.  Comparative evaluation of reverse engineering gene regulatory networks with relevance networks, graphical gaussian models and bayesian networks , 2006, Bioinform..

[105]  Pascal Kahlem,et al.  ENFIN—a Network to Enhance Integrative Systems Biology , 2007, Annals of the New York Academy of Sciences.

[106]  Falk Schreiber,et al.  Analysis of Biological Networks , 2008 .

[107]  S. Wasserman,et al.  Models and Methods in Social Network Analysis , 2005 .

[108]  Antonio Reverter,et al.  Combining partial correlation and an information theory approach to the reversed engineering of gene co-expression networks , 2008, Bioinform..

[109]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[110]  Sach Mukherjee,et al.  Network inference using informative priors , 2008, Proceedings of the National Academy of Sciences.

[111]  V W Berger,et al.  Pros and cons of permutation tests in clinical trials. , 2000, Statistics in medicine.

[112]  Mark E. J. Newman,et al.  Structure and Dynamics of Networks , 2009 .

[113]  A. Wagner The Connectivity of Large Genetic Networks: Design, History, or Mere Chemistry? , 2007 .

[114]  A. Califano,et al.  Dialogue on Reverse‐Engineering Assessment and Methods , 2007, Annals of the New York Academy of Sciences.

[115]  Luca Cardelli,et al.  Abstract Machines of Systems Biology , 2005, Trans. Comp. Sys. Biology.

[116]  Lixing Zhu,et al.  Nonparametric Monte Carlo tests and their applications , 2005 .

[117]  Amanda Clare Integration of Genomic and Phenotypic Data , 2005, Data Analysis and Visualization in Genomics and Proteomics.

[118]  Susan Holmes,et al.  Statistics for phylogenetic trees. , 2003, Theoretical population biology.

[119]  Christian Borgelt,et al.  Graphical models - methods for data analysis and mining , 2002 .

[120]  Lawrence Hubert,et al.  The Structural Representation of Proximity Matrices with MATLAB , 2006 .

[121]  J. Reichardt,et al.  Structure in Complex Networks , 2008 .

[122]  John Scott Social Network Analysis , 1988 .

[123]  Dirk Husmeier,et al.  Introduction to Statistical Phylogenetics , 2005 .

[124]  Korbinian Strimmer,et al.  An empirical Bayes approach to inferring large-scale gene association networks , 2005, Bioinform..

[125]  M. Dehmer,et al.  Analysis of Microarray Data: A Network-Based Approach , 2008 .

[126]  Korbinian Strimmer,et al.  Learning Large‐Scale Graphical Gaussian Models from Genomic Data , 2005 .

[127]  G. A. Edgar Measure, Topology, and Fractal Geometry , 1990 .

[128]  M. Buchanan,et al.  Networks in cell biology , 2010 .

[129]  Marcel J. T. Reinders,et al.  Metabolic pathway alignment between species using a comprehensive and flexible similarity measure , 2008, BMC Systems Biology.

[130]  James R. Schott,et al.  Testing for the equality of several correlation matrices , 1997 .

[131]  A. Barabasi,et al.  Power Laws in Biological Networks , 2004, q-bio/0401010.

[132]  M. Daly,et al.  PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes , 2003, Nature Genetics.

[133]  P. Białas,et al.  Science of Complex Networks: From Biology to the Internet and WWW , 2005 .

[134]  Kathleen M. Carley,et al.  The interaction of size and density with graph-level indices , 1999, Soc. Networks.

[135]  G. A. Edgar Integral, probability, and fractal measures , 1997 .

[136]  Luonan Chen,et al.  Biomolecular Networks: Methods and Applications in Systems Biology , 2009 .

[137]  J. H. Steiger Tests for comparing elements of a correlation matrix. , 1980 .

[138]  J. N. R. Jeffers,et al.  Graphical Models in Applied Multivariate Statistics. , 1990 .

[139]  Katsuhisa Horimoto,et al.  BMC Systems Biology BioMed Central Methodology article , 2008 .

[140]  Stanley Wasserman,et al.  Social Network Analysis: Methods and Applications , 1994 .

[141]  Joel S. Bader,et al.  Where Have All the Interactions Gone? Estimating the Coverage of Two-Hybrid Protein Interaction Maps , 2007, PLoS Comput. Biol..

[142]  Garry Robins,et al.  Statistical Models for Networks: A Brief Review of Some Recent Research , 2006, SNA@ICML.

[143]  Kathleen M. Carley,et al.  Metric inference for social networks , 1994 .

[144]  Danail Bonchev,et al.  Phylogenetic distances are encoded in networks of interacting pathways , 2008, Bioinform..

[145]  F. Pesarin Multivariate Permutation Tests : With Applications in Biostatistics , 2001 .

[146]  Ling Yang,et al.  Deducing topology of protein-protein interaction networks from experimentally measured sub-networks , 2008, BMC Bioinformatics.

[147]  Falk Schreiber,et al.  Frequency Concepts and Pattern Detection for the Analysis of Motifs in Networks , 2005, Trans. Comp. Sys. Biology.

[148]  B. Manly Randomization, Bootstrap and Monte Carlo Methods in Biology , 2018 .

[149]  Ulrik Brandes,et al.  Network Analysis: Methodological Foundations , 2010 .

[150]  Rainer Spang,et al.  Inferring cellular networks – a review , 2007, BMC Bioinformatics.

[151]  Johannes Jaeger,et al.  Parameter estimation and determinability analysis applied to Drosophila gap gene circuits , 2008, BMC Systems Biology.

[152]  J. S. Bader The Drosophila Protein Interaction Network May Be neither Power-Law nor Scale-Free , 2006 .

[153]  John Skvoretz,et al.  8. Comparing Networks across Space and Time, Size and Species , 2002 .

[154]  Gustavo Stolovitzky,et al.  Reconstructing biological networks using conditional correlation analysis , 2005, Bioinform..

[155]  Hongzhe Li,et al.  Statistical Methods for Inference of Genetic Networks and Regulatory Modules , 2007 .

[156]  M. Stumpf,et al.  A likelihood approach to analysis of network data , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[157]  George Kesidis An Introduction to Communication Network Analysis , 2007 .

[158]  M. Dehmer,et al.  Analysis of Complex Networks: From Biology to Linguistics , 2009 .

[159]  Hongzhe Li,et al.  Co-expression networks: graph properties and topological comparisons , 2010, Bioinform..

[160]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[161]  M. van Engeland,et al.  E2Fs mediate a fundamental cell‐cycle deregulation in high‐grade serous ovarian carcinomas , 2009, The Journal of pathology.

[162]  R. Blossey Computational Biology: A Statistical Mechanics Perspective , 2006 .

[163]  Angela P. Presson,et al.  Integrated Weighted Gene Co-expression Network Analysis with an Application to Chronic Fatigue Syndrome , 2008, BMC Systems Biology.

[164]  T. Perkins The Gap Gene System of Drosophila melanogaster , 2007, Annals of the New York Academy of Sciences.

[165]  Bill Shipley,et al.  A permutation procedure for testing the equality of pattern hypotheses across groups involving correlation or covariance matrices , 2000, Stat. Comput..

[166]  P. Bickel,et al.  Some theory for Fisher''s linear discriminant function , 2004 .