Computational analysis of molecular networks: modeling and reconstruction

Motivation: A central step in the analysis of gene expression data is the identification of groups of genes that exhibit similar expression patterns. Clustering gene expression data into homogeneous groups was shown to be instrumental in functional annotation, tissue classification, regulatory motif identification, and other applications. Although there is a rich literature on clustering algorithms for gene expression analysis, very few works addressed the systematic comparison and evaluation of clustering results. Typically, different clustering algorithms yield different clustering solutions on the same data, and there is no agreed upon guideline for choosing among them. Results: We developed a novel statistically based method for assessing a clustering solution according to prior biological knowledge. Our method can be used to compare different clustering solutions or to optimize the parameters of a clustering algorithm. The method is based on projecting vectors of biological attributes of the clustered elements onto the real line, such that the ratio of between-groups and within-group variance estimators is maximized. The projected data are then scored using a non-parametric analysis of variance test, and the score’s confidence is evaluated. We validate our approach using simulated data and show that our scoring method outperforms several extant methods, including the separation to homogeneity ratio and the silhouette measure. We apply our method to evaluate results of several clustering methods on yeast cell-cycle gene expression data. Availability: The software is available from the authors upon request. Contact: iritg@post.tau.ac.il; rshamir@post.tau.ac.il; roded@ icsi.berkeley.edu

[1]  S. S. Wilks The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses , 1938 .

[2]  Silvan S. Tomkins,et al.  The Scoring Scheme. , 1947 .

[3]  G H Ball,et al.  A clustering technique for summarizing multivariate data. , 1967, Behavioral science.

[4]  F. James Rohlf,et al.  Biometry: The Principles and Practice of Statistics in Biological Research , 1969 .

[5]  S. Kauffman The large scale structure and dynamics of gene control circuits: an ensemble approach. , 1974, Journal of theoretical biology.

[6]  Stephen E. Fienberg,et al.  Discrete Multivariate Analysis: Theory and Practice , 1976 .

[7]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[8]  M McSweeney,et al.  A Multivariate Kruskal-Wallis Test With Post Hoc Procedures. , 1980, Multivariate behavioral research.

[9]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[10]  G. McLachlan On Bootstrapping the Likelihood Ratio Test Statistic for the Number of Components in a Normal Mixture , 1987 .

[11]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[12]  Gregory F. Cooper,et al.  The Computational Complexity of Probabilistic Inference Using Bayesian Belief Networks , 1990, Artif. Intell..

[13]  H. Kawasaki,et al.  Two yeast genes encoding calmodulin-dependent protein kinases. Isolation, sequencing and bacterial expressions of CMK1 and CMK2. , 1991, The Journal of biological chemistry.

[14]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[15]  G. McLachlan Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[16]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[17]  Noga Alon,et al.  Matching nuts and bolts , 1994, SODA '94.

[18]  C. J. Huberty,et al.  Applied Discriminant Analysis , 1994 .

[19]  Paul D. Seymour,et al.  Packing directed circuits fractionally , 1995, Comb..

[20]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[21]  Wray L. Buntine Chain graphs for learning , 1995, UAI.

[22]  Nader H. Bshouty Exact Learning Boolean Function via the Monotone Theory , 1995, Inf. Comput..

[23]  János Komlós,et al.  Matching nuts and bolts in O(n log n) time , 1996, SODA '96.

[24]  Pierre Hansen,et al.  Cluster analysis and mathematical programming , 1997, Math. Program..

[25]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[26]  D Thieffry,et al.  Qualitative analysis of gene networks. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[27]  Kevin P. Murphy,et al.  Learning the Structure of Dynamic Probabilistic Networks , 1998, UAI.

[28]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[29]  D. Whelan,et al.  THE PROMISE ( AND PERIL ) , 2017 .

[30]  M. Gustin,et al.  MAP Kinase Pathways in the YeastSaccharomyces cerevisiae , 1998, Microbiology and Molecular Biology Reviews.

[31]  S Fuhrman,et al.  Reveal, a general reverse engineering algorithm for inference of genetic network architectures. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[32]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[33]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[34]  Ron Shamir,et al.  Clustering Gene Expression Patterns , 1999, J. Comput. Biol..

[35]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[36]  R. Serrano,et al.  Repressors and Upstream Repressing Sequences of the Stress-Regulated ENA1 Gene in Saccharomyces cerevisiae: bZIP Protein Sko1p Confers HOG-Dependent Osmotic Regulation , 1999, Molecular and Cellular Biology.

[37]  Patrik D'haeseleer,et al.  Linear Modeling of mRNA Expression Levels During CNS Development and Injury , 1998, Pacific Symposium on Biocomputing.

[38]  J. Mesirov,et al.  Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[39]  Satoru Miyano,et al.  Identification of Genetic Networks from a Small Number of Gene Expression Patterns Under the Boolean Network Model , 1998, Pacific Symposium on Biocomputing.

[40]  J. Thevelein,et al.  Osmotic Stress-Induced Gene Expression in Saccharomyces cerevisiae Requires Msn1p and the Novel Nuclear Factor Hot1p , 1999, Molecular and Cellular Biology.

[41]  Patrik D'haeseleer,et al.  Genetic network inference: from co-expression clustering to reverse engineering , 2000, Bioinform..

[42]  Roded Sharan,et al.  Center CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis , 2000, ISMB.

[43]  T. Hughes,et al.  Signaling and circuitry of multiple MAPK pathways revealed by a matrix of global gene expression profiles. , 2000, Science.

[44]  Yudong D. He,et al.  Functional Discovery via a Compendium of Expression Profiles , 2000, Cell.

[45]  J. Thevelein,et al.  The Transcriptional Response of Saccharomyces cerevisiae to Osmotic Shock , 2000, The Journal of Biological Chemistry.

[46]  Michal Linial,et al.  Using Bayesian Networks to Analyze Expression Data , 2000, J. Comput. Biol..

[47]  V. Thorsson,et al.  Discovery of regulatory interactions through perturbation: inference and experimental design. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[48]  Katherine C. Chen,et al.  Kinetic analysis of a molecular model of the budding yeast cell cycle. , 2000, Molecular biology of the cell.

[49]  Tommi S. Jaakkola,et al.  Tutorial on variational approximation methods , 2000 .

[50]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[51]  George M. Church,et al.  Biclustering of Expression Data , 2000, ISMB.

[52]  R. Sharan,et al.  CLICK: a clustering algorithm with applications to gene expression analysis. , 2000, Proceedings. International Conference on Intelligent Systems for Molecular Biology.

[53]  Dmitrij Frishman,et al.  MIPS: a database for genomes and protein sequences , 2000, Nucleic Acids Res..

[54]  R. Albert,et al.  The large-scale organization of metabolic networks , 2000, Nature.

[55]  W. H. Mager,et al.  Response of Saccharomyces cerevisiae to severe osmotic stress: evidence for a novel activation mechanism of the HOG MAP kinase pathway , 2000, Molecular microbiology.

[56]  Brendan J. Frey,et al.  Factor graphs and the sum-product algorithm , 2001, IEEE Trans. Inf. Theory.

[57]  H. Bussemaker,et al.  Regulatory element detection using correlation with expression , 2001, Nature Genetics.

[58]  M. Vidal,et al.  Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or "interologs". , 2001, Genome research.

[59]  Ben Taskar,et al.  Rich probabilistic models for gene expression , 2001, ISMB.

[60]  Ka Yee Yeung,et al.  Validating clustering for gene expression data , 2001, Bioinform..

[61]  Ron Shamir,et al.  Computational expansion of genetic networks , 2001, ISMB.

[62]  Gary D Bader,et al.  Systematic Genetic Analysis with Ordered Arrays of Yeast Deletion Mutants , 2001, Science.

[63]  Roger E Bumgarner,et al.  Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. , 2001, Science.

[64]  M. Marton,et al.  Transcriptional Profiling Shows that Gcn4p Is a Master Regulator of Gene Expression during Amino Acid Starvation in Yeast , 2001, Molecular and Cellular Biology.

[65]  T. Hughes,et al.  Role of scaffolds in MAP kinase pathway specificity revealed by custom design of pathway-dedicated signaling proteins , 2001, Current Biology.

[66]  Tommi S. Jaakkola,et al.  Combining Location and Expression Data for Principled Discovery of Genetic Regulatory Network Models , 2001, Pacific Symposium on Biocomputing.

[67]  G. Jogesh Babu,et al.  Multivariate Permutation Tests , 2002, Technometrics.

[68]  A. Hoffmann,et al.  The I (cid:1) B –NF-(cid:1) B Signaling Module: Temporal Control and Selective Gene Activation , 2022 .

[69]  T. Jaakkola,et al.  Bayesian Network Approach to Cell Signaling Pathway Modeling , 2002, Science's STKE.

[70]  V. Anne Smith,et al.  Evaluating functional network inference using simulations of complex biological systems , 2002, ISMB.

[71]  S. Shen-Orr,et al.  Network motifs in the transcriptional regulation network of Escherichia coli , 2002, Nature Genetics.

[72]  T. Hunter,et al.  The Protein Kinase Complement of the Human Genome , 2002, Science.

[73]  M. Anthony The sample complexity and computational complexity of Boolean function learning , 2002 .

[74]  U. Alon,et al.  Assigning numbers to the arrows: Parameterizing a gene regulation network by using accurate expression kinetics , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[75]  A. Hoffmann,et al.  The IkappaB-NF-kappaB signaling module: temporal control and selective gene activation. , 2002, Science.

[76]  Yaniv Ziv,et al.  Revealing modular organization in the yeast transcriptional network , 2002, Nature Genetics.

[77]  S. Hohmann Osmotic Stress Signaling and Osmoadaptation in Yeasts , 2002, Microbiology and Molecular Biology Reviews.

[78]  Nicola J. Rinaldi,et al.  Transcriptional Regulatory Networks in Saccharomyces cerevisiae , 2002, Science.

[79]  Benno Schwikowski,et al.  Discovering regulatory and signalling circuits in molecular interaction networks , 2002, ISMB.

[80]  T. Yuzyuk,et al.  The MEK kinase Ssk2p promotes actin cytoskeleton recovery after osmotic stress. , 2002, Molecular biology of the cell.

[81]  Isabelle Guyon,et al.  A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[82]  Chiara Sabatti,et al.  Co-expression pattern from DNA microarray experiments as a tool for operon prediction , 2002, Nucleic Acids Res..

[83]  E. Gilles,et al.  Computational modeling of the dynamics of the MAP kinase cascade activated by surface and internalized EGF receptors , 2002, Nature Biotechnology.

[84]  H. Mewes,et al.  Bioinformatics and Genome Analysis , 2002, Ernst Schering Research Foundation Workshop.

[85]  Mark J. van der Laan,et al.  A Method to Identify Signicant Clusters in Gene Expression Data , 2002 .

[86]  G. Johnson,et al.  Mitogen-Activated Protein Kinase Pathways Mediated by ERK, JNK, and p38 Protein Kinases , 2002, Science.

[87]  R. Sharan,et al.  Cluster analysis and its applications to gene expression data. , 2002, Ernst Schering Research Foundation workshop.

[88]  Chris Vulpe,et al.  Discriminant analysis to evaluate clustering of gene expression data , 2002, FEBS letters.

[89]  Daniel Hanisch,et al.  Co-clustering of biological networks and gene expression data , 2002, ISMB.

[90]  Ronald W. Davis,et al.  Functional profiling of the Saccharomyces cerevisiae genome , 2002, Nature.

[91]  D. Botstein,et al.  Genome-wide Analysis of Gene Expression Regulated by the Calcineurin/Crz1p Signaling Pathway in Saccharomyces cerevisiae * , 2002, The Journal of Biological Chemistry.

[92]  T. Cooper,et al.  Mks1p Is Required for Negative Regulation of Retrograde Gene Expression in Saccharomyces cerevisiae but Does Not Affect Nitrogen Catabolite Repression-sensitive Gene Expression* , 2002, The Journal of Biological Chemistry.

[93]  Amos Tanay,et al.  Minreg: Inferring an active regulator set , 2002, ISMB.

[94]  Roded Sharan,et al.  Discovering statistically significant biclusters in gene expression data , 2002, ISMB.

[95]  K. Struhl,et al.  Hog1 kinase converts the Sko1-Cyc8-Tup1 repressor complex into an activator that recruits SAGA and SWI/SNF in response to osmotic stress. , 2002, Molecular cell.

[96]  William T. Freeman,et al.  Understanding belief propagation and its generalizations , 2003 .

[97]  M. Daly,et al.  PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes , 2003, Nature Genetics.

[98]  Reinhart Heinrich,et al.  The Roles of APC and Axin Derived from Experimental and Theoretical Analysis of the Wnt Pathway , 2003, PLoS biology.

[99]  J. Collins,et al.  Inferring Genetic Networks and Identifying Compound Mode of Action via Expression Profiling , 2003, Science.

[100]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[101]  R. Aebersold,et al.  Proteomics: the first decade and beyond , 2003, Nature Genetics.

[102]  Nicola J. Rinaldi,et al.  Computational discovery of gene modules and regulatory networks , 2003, Nature Biotechnology.

[103]  Satoru Miyano,et al.  Combining microarrays and biological knowledge for estimating gene networks via Bayesian networks , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[104]  E. Winzeler,et al.  Protein pathway and complex clustering of correlated mRNA and protein expression analyses in Saccharomyces cerevisiae , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[105]  Satoru Miyano,et al.  Estimating gene networks from gene expression data by combining Bayesian network model with promoter element detection , 2003, ECCB.

[106]  D. Pe’er,et al.  Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data , 2003, Nature Genetics.

[107]  M. West,et al.  Gene expression phenotypic models that predict the activity of oncogenic pathways , 2003, Nature Genetics.

[108]  Jun S. Liu,et al.  Integrating regulatory motif discovery and genome-wide expression analysis , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[109]  B. Birren,et al.  Sequencing and comparison of yeast species to identify genes and regulatory elements , 2003, Nature.

[110]  Ron Shamir,et al.  Chain functions and scoring functions in genetic networks , 2003, ISMB.

[111]  Satoru Miyano,et al.  Identification of genetic networks by strategic gene disruptions and gene overexpressions under a boolean model , 2003, Theor. Comput. Sci..

[112]  Ron Shamir,et al.  Modeling transcription programs: inferring binding site activity and dose-response model optimization , 2003, RECOMB '03.

[113]  Ron Shamir,et al.  Scoring clustering solutions by their biological relevance , 2003, Bioinform..

[114]  Satoru Miyano,et al.  Bayesian Network and Nonparametric Heteroscedastic Regression for Nonlinear Modeling of Genetic Network , 2003, J. Bioinform. Comput. Biol..

[115]  Satoru Miyano,et al.  Inferring gene networks from time series microarray data using dynamic Bayesian networks , 2003, Briefings Bioinform..

[116]  M. Gerstein,et al.  A Bayesian Networks Approach for Predicting Protein-Protein Interactions from Genomic Data , 2003, Science.

[117]  J. Lieb,et al.  ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. , 2004, Genomics.

[118]  I. Herskowitz,et al.  Unique and redundant roles for HOG MAPK pathway components as revealed by whole-genome expression analysis. , 2003, Molecular biology of the cell.

[119]  Ron Shamir,et al.  Multilevel Modeling and Inference of Transcription Regulation , 2004, J. Comput. Biol..

[120]  R. Milo,et al.  Network motifs in integrated cellular networks of transcription-regulation and protein-protein interaction. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[121]  Richard M. Karp,et al.  Reconstructing Chain Functions in Genetic Networks , 2004, Pacific Symposium on Biocomputing.

[122]  Markus J. Herrgård,et al.  Integrating high-throughput and computational data elucidates bacterial networks , 2004, Nature.

[123]  T. Hunter,et al.  The mouse kinome: discovery and comparative genomics of all mouse protein kinases. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[124]  Roded Sharan,et al.  Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[125]  Ron Shamir,et al.  Modeling and Analysis of Heterogeneous Regulation in Biological Networks , 2004, Regulatory Genomics.

[126]  B Marshall,et al.  Gene Ontology Consortium: The Gene Ontology (GO) database and informatics resource , 2004, Nucleic Acids Res..

[127]  Gary D Bader,et al.  Global Mapping of the Yeast Genetic Interaction Network , 2004, Science.

[128]  Nicola J. Rinaldi,et al.  Transcriptional regulatory code of a eukaryotic genome , 2004, Nature.

[129]  Eulàlia de Nadal,et al.  The MAPK Hog1 recruits Rpd3 histone deacetylase to activate osmoresponsive genes , 2004, Nature.

[130]  R. Shamir,et al.  A global view of the selection forces in the evolution of yeast cis-regulation. , 2004, Genome research.

[131]  S. Shen-Orr,et al.  Superfamilies of Evolved and Designed Networks , 2004, Science.

[132]  D. Koller,et al.  A module map showing conditional activity of expression modules in cancer , 2004, Nature Genetics.

[133]  Michael A. Beer,et al.  Predicting Gene Expression from Sequence , 2004, Cell.

[134]  Nir Friedman,et al.  Inferring Cellular Networks Using Probabilistic Graphical Models , 2004, Science.

[135]  Tommi S. Jaakkola,et al.  Physical Network Models , 2004, J. Comput. Biol..

[136]  Nir Friedman,et al.  Inferring quantitative models of regulatory networks from expression data , 2004, ISMB/ECCB.

[137]  C. Rodrigues-Pousada,et al.  Expression of YAP4 in Saccharomyces cerevisiae under osmotic stress. , 2004, The Biochemical journal.

[138]  K. Lindblad-Toh,et al.  Systematic discovery of regulatory motifs in human promoters and 3′ UTRs by comparison of several mammals , 2005, Nature.

[139]  S. Fields High‐throughput two‐hybrid analysis , 2005, The FEBS journal.

[140]  John D. Storey,et al.  A network-based analysis of systemic inflammation in humans , 2005, Nature.

[141]  J. Ecker,et al.  Applications of DNA tiling arrays for whole-genome analysis. , 2005, Genomics.

[142]  K. Sachs,et al.  Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data , 2005, Science.

[143]  William Stafford Noble,et al.  Assessing computational tools for the discovery of transcription factor binding sites , 2005, Nature Biotechnology.

[144]  Lincoln Stein,et al.  Reactome: a knowledgebase of biological pathways , 2004, Nucleic Acids Res..

[145]  T. Jaakkola,et al.  Validation and refinement of gene-regulatory pathways on a network of physical interactions , 2005, Genome Biology.

[146]  William T. Freeman,et al.  Constructing free-energy approximations and generalized belief propagation algorithms , 2005, IEEE Transactions on Information Theory.

[147]  W. Dunn,et al.  Measuring the metabolome: current analytical technologies. , 2005, The Analyst.

[148]  Ron Shamir,et al.  The Factor Graph Network Model for Biological Systems , 2005, RECOMB.

[149]  T. Ideker,et al.  Systematic interpretation of genetic interactions using protein networks , 2005, Nature Biotechnology.

[150]  K. Hossner,et al.  Cellular and molecular biology. , 2005 .

[151]  E. Klipp,et al.  Integrative model of the response of yeast to osmotic shock , 2005, Nature Biotechnology.

[152]  Ron Shamir,et al.  A Probabilistic Methodology for Integrating Knowledge and Experiments on Biological Networks , 2006, J. Comput. Biol..

[153]  T. Ideker,et al.  Supporting Online Material for A Systems Approach to Mapping DNA Damage Response Pathways , 2006 .

[154]  Michael B. Yaffe,et al.  Data-driven modelling of signal-transduction networks , 2006, Nature Reviews Molecular Cell Biology.

[155]  T. Ideker,et al.  Modeling cellular machinery through biological network comparison , 2006, Nature Biotechnology.

[156]  D. Lauffenburger,et al.  Applying computational modeling to drug discovery and development. , 2006, Drug discovery today.

[157]  D. Lauffenburger,et al.  Physicochemical modelling of cell signalling pathways , 2006, Nature Cell Biology.

[158]  A. Mogilner,et al.  Quantitative modeling in cell biology: what is it good for? , 2006, Developmental cell.

[159]  Jacky L. Snoep,et al.  BioModels Database: a free, centralized database of curated, published, quantitative kinetic models of biochemical and cellular systems , 2005, Nucleic Acids Res..

[160]  E. Davidson,et al.  Deciphering the Underlying Mechanism of Specification and Differentiation: The Sea Urchin Gene Regulatory Network , 2006, Science's STKE.

[161]  D. Lauffenburger,et al.  Computational modelling of ErbB family phosphorylation dynamics in response to transforming growth factor alpha and heregulin indicates spatial compartmentation of phosphatase activity. , 2006, Systems biology.

[162]  Markus J. Herrgård,et al.  Integrated analysis of regulatory and metabolic networks reveals novel regulatory mechanisms in Saccharomyces cerevisiae. , 2006, Genome research.

[163]  R. Shamir,et al.  Pathway redundancy and protein essentiality revealed in the Saccharomyces cerevisiae interaction networks , 2007, Molecular systems biology.

[164]  R. Shamir,et al.  Refinement and expansion of signaling pathways: the osmotic response network in yeast. , 2007, Genome research.

[165]  Karen Sachs Multiparameter Single-Cell Data Causal Protein-Signaling Networks Derived from , 2008 .

[166]  H. Katzgraber Introduction to Monte Carlo Methods , 2009, 0905.1629.