Combining heterogeneous sources of data for the reverse-engineering of gene regulatory networks

Gene Regulatory Networks (GRNs) represent how genes interact in various cellular processes by describing how the expression level, or activity, of genes can affect the expression of the other genes. Reverseengineering GRN models can help biologists understand and gain insight into genetic conditions and diseases. Recently, the increasingly widespread use of DNA microarrays, a high-throughput technology that allows the expression of thousands of genes to be measured simultaneously in biological experiments, has led to many datasets of gene expression measurements becoming publicly available and a subsequent explosion of research in the reverse-engineering of GRN models. However, microarray technology has a number of limitations as a data source for the modelling of GRNs, due to concerns over its reliability and the reproducibility of experimental results. The underlying theme of the research presented in this thesis is the incorporation of multiple sources and different types of data into techniques for reverse-engineering or learning GRNs from data. By drawing on many data sources, the resulting network models should be more robust, accurate and reliable than models that have been learnt using a single data source. This is achieved by focusing on two main strands of research. First, the thesis presents some of the earliest work in the incorporation of prior knowledge that has been generated from a large body of scientific papers, for Bayesian network based GRN models. Second, novel methods for the use of multiple microarray datasets to produce Bayesian network based GRN models are introduced. Empirical evaluations are used to show that the incorporation of literature-based prior knowledge and combining multiple microarray datasets can provide an improvement, when compared to the use of a single microarray dataset, for the reverse-engineering of Bayesian network based GRN models.

[1]  Michael Ruogu Zhang,et al.  Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. , 1998, Molecular biology of the cell.

[2]  Xiaojiang Xu,et al.  Learning module networks from genome‐wide location and expression data , 2004, FEBS letters.

[3]  M. Long,et al.  Modulation of MDM2/p53 and cyclin-activating kinase during the megakaryocyte differentiation of human erythroleukemia cells. , 2002, Experimental hematology.

[4]  R. Mike Cameron-Jones,et al.  FOIL: A Midterm Report , 1993, ECML.

[5]  Xiong Wang,et al.  Toward a General Framework for Microarray Data Comparison , 2006, The Sixth IEEE International Conference on Computer and Information Technology (CIT'06).

[6]  Lorenz Wernisch Can Replication Save Noisy Microarray Data? , 2002, Comparative and functional genomics.

[7]  Jo McEntyre,et al.  The NCBI Handbook , 2002 .

[8]  Richard Baumgartner,et al.  Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions , 2003, Bioinform..

[9]  Homin K. Lee,et al.  Coexpression analysis of human genes across many microarray data sets. , 2004, Genome research.

[10]  G. A. Whitmore,et al.  Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Robert Tibshirani,et al.  An Introduction to the Bootstrap , 1994 .

[12]  Hanlee P. Ji,et al.  The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. , 2006, Nature biotechnology.

[13]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[14]  J. Courcelle,et al.  Comparative gene expression profiles following UV exposure in wild-type and SOS-deficient Escherichia coli. , 2001, Genetics.

[15]  I Pournara,et al.  Reconstructing gene networks by passive and active Bayesian learning. , 2005 .

[16]  Ross D. Shachter Evaluating Influence Diagrams , 1986, Oper. Res..

[17]  Minghong Xu,et al.  Histone Deacetylase 3 Interacts with and Deacetylates Myocyte Enhancer Factor 2 , 2006, Molecular and Cellular Biology.

[18]  Joseph Beyene,et al.  Statistical Methods for Meta-Analysis of Microarray Data: A Comparative Study , 2006, Inf. Syst. Frontiers.

[19]  Miller Ra,et al.  Making the conceptual connections: the Unified Medical Language System (UMLS) after a decade of research and development. , 1998 .

[20]  R A Miller,et al.  Making the conceptual connections: the Unified Medical Language System (UMLS) after a decade of research and development. , 1998, Journal of the American Medical Informatics Association : JAMIA.

[21]  Steven J. M. Jones,et al.  Locating mammalian transcription factor binding sites: a survey of computational and experimental techniques. , 2006, Genome research.

[22]  Anne Lohrli Chapman and Hall , 1985 .

[23]  P. Quillardet,et al.  DNA array analysis of gene expression in response to UV irradiation in Escherichia coli. , 2003, Research in microbiology.

[24]  S. Džeroski,et al.  Relational Data Mining , 2001, Springer Berlin Heidelberg.

[25]  Erik M. van Mulligen,et al.  Co-occurrence based meta-analysis of scientific texts: retrieving biological relationships between genes , 2005, Bioinform..

[26]  Steven J. M. Jones,et al.  Text-mining assisted regulatory annotation , 2008, Genome Biology.

[27]  Edward R. Dougherty,et al.  Probabilistic Boolean networks: a rule-based uncertainty model for gene regulatory networks , 2002, Bioinform..

[28]  L. Breeden,et al.  Conserved homeodomain proteins interact with MADS box protein Mcm1 to restrict ECB-dependent transcription to the M/G1 phase of the cell cycle. , 2002, Genes & development.

[29]  S. Sealfon,et al.  Accuracy and calibration of commercial oligonucleotide and custom cDNA microarrays. , 2002, Nucleic acids research.

[30]  Nir Friedman,et al.  Inferring subnetworks from perturbed expression profiles , 2001, ISMB.

[31]  Alex J. Sutton,et al.  Methods for Meta-Analysis in Medical Research , 2000 .

[32]  Min Zou,et al.  A new dynamic Bayesian network (DBN) approach for identifying gene regulatory networks from time course microarray data , 2005, Bioinform..

[33]  David Maxwell Chickering,et al.  A Transformational Characterization of Equivalent Bayesian Network Structures , 1995, UAI.

[34]  G. Churchill,et al.  A comparison of cDNA, oligonucleotide, and Affymetrix GeneChip gene expression microarray platforms. , 2004, Journal of biomolecular techniques : JBT.

[35]  N. Laird,et al.  Meta-analysis in clinical trials. , 1986, Controlled clinical trials.

[36]  Jun S. Liu,et al.  Bayesian models for pooling microarray studies with multiple sources of replications , 2006, BMC Bioinformatics.

[37]  Kevin P. Murphy,et al.  Learning the Structure of Dynamic Probabilistic Networks , 1998, UAI.

[38]  Kevin Murphy,et al.  Bayes net toolbox for Matlab , 1999 .

[39]  Lesley Jones,et al.  Microarray Gene Expression Data Analysis: A Beginners Guide , 2004, Human Genetics.

[40]  Shao Li,et al.  Constructing biological networks through combined literature mining and microarray analysis: a LMMA approach , 2006, Bioinform..

[41]  H. Akaike A new look at the statistical model identification , 1974 .

[42]  Barend Mons,et al.  Assignment of protein function and discovery of novel nucleolar proteins based on automatic analysis of MEDLINE , 2007, Proteomics.

[43]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[44]  A. Khodursky,et al.  A classification based framework for quantitative description of large-scale microarray data , 2006 .

[45]  See-Kiong Ng,et al.  On combining multiple microarray studies for improved functional classification by whole-dataset feature selection. , 2003, Genome informatics. International Conference on Genome Informatics.

[46]  J. Pearl Causality: Models, Reasoning and Inference , 2000 .

[47]  Peter Clark,et al.  The CN2 induction algorithm , 2004, Machine Learning.

[48]  John Quackenbush Microarray data normalization and transformation , 2002, Nature Genetics.

[49]  Debashis Ghosh,et al.  Prostate Cancer Expression Profiles Reveals Pathway Dysregulation in Meta-Analysis of Microarrays : Interstudy Validation of Gene Updated , 2002 .

[50]  Andrew J. Bulpitt,et al.  A Primer on Learning in Bayesian Networks for Computational Biology , 2007, PLoS Comput. Biol..

[51]  S. Knudsen,et al.  A new non-linear normalization method for reducing variability in DNA microarray experiments , 2002, Genome Biology.

[52]  W. Cleveland Robust Locally Weighted Regression and Smoothing Scatterplots , 1979 .

[53]  Allan Tucker,et al.  Consensus gene regulatory networks: combining multiple microarray gene expression datasets , 2008 .

[54]  Emma Steele,et al.  Consensus and Meta-analysis regulatory networks for combining multiple microarray gene expression datasets , 2008, J. Biomed. Informatics.

[55]  A. Valencia,et al.  Text-mining and information-retrieval services for molecular biology , 2005, Genome Biology.

[56]  Adrian E. Raftery,et al.  Bayesian model averaging: a tutorial (with comments by M. Clyde, David Draper and E. I. George, and a rejoinder by the authors , 1999 .

[57]  Bart Demoen,et al.  Improving the Efficiency of Inductive Logic Programming Through the Use of Query Packs , 2011, J. Artif. Intell. Res..

[58]  William Stafford Noble,et al.  The Forkhead transcription factor Hcm1 regulates chromosome segregation genes and fills the S-phase gap in the transcriptional circuitry of the cell cycle. , 2006, Genes & development.

[59]  Stuart J. Russell,et al.  Dynamic bayesian networks: representation, inference and learning , 2002 .

[60]  李幼升,et al.  Ph , 1989 .

[61]  J. Rothberg,et al.  Gaining confidence in high-throughput protein interaction networks , 2004, Nature Biotechnology.

[62]  R. Camerini-Otero,et al.  Over 1000 genes are involved in the DNA damage response of Escherichia coli , 2002, Molecular microbiology.

[63]  Ronald W. Davis,et al.  A genome-wide transcriptional analysis of the mitotic cell cycle. , 1998, Molecular cell.

[64]  P. Gehler,et al.  An introduction to graphical models , 2001 .

[65]  Terry Speed,et al.  Normalization of cDNA microarray data. , 2003, Methods.

[66]  Vladimir Filkov,et al.  Identifying Gene Regulatory Networks from Gene Expression Data , 2005 .

[67]  中尾 光輝,et al.  KEGG(Kyoto Encyclopedia of Genes and Genomes)〔和文〕 (特集 ゲノム医学の現在と未来--基礎と臨床) -- (データベース) , 2000 .

[68]  Julio Collado-Vides,et al.  RegulonDB (version 5.0): Escherichia coli K-12 transcriptional regulatory network, operon organization, and growth conditions , 2005, Nucleic Acids Res..

[69]  Tom M. Mitchell,et al.  Inferring pairwise regulatory relationships from multiple time series datasets , 2007, Bioinform..

[70]  D. Botstein,et al.  Two yeast forkhead genes regulate the cell cycle and pseudohyphal growth , 2000, Nature.

[71]  David Page,et al.  Modelling regulatory pathways in E. coli from time series expression profiles , 2002, ISMB.

[72]  G. Church,et al.  Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae , 2001, Nature Genetics.

[73]  Lucila Ohno-Machado,et al.  Analysis of matched mRNA measurements from two different microarray technologies , 2002, Bioinform..

[74]  Michael P. Wellman,et al.  Graphical Representations of Consensus Belief , 1999, UAI.

[75]  Trupti Joshi,et al.  Inferring gene regulatory networks from multiple microarray datasets , 2006, Bioinform..

[76]  Daphne Koller,et al.  Genome-wide discovery of transcriptional modules from DNA sequence and gene expression , 2003, ISMB.

[77]  Sergei Egorov,et al.  MedScan, a natural language processing engine for MEDLINE abstracts , 2003, Bioinform..

[78]  P. Spirtes,et al.  Causation, Prediction, and Search, 2nd Edition , 2001 .

[79]  C. Ball,et al.  Saccharomyces Genome Database. , 2002, Methods in enzymology.

[80]  M. Downes,et al.  The nuclear receptor corepressor N-CoR regulates differentiation: N-CoR directly interacts with MyoD. , 1999, Molecular endocrinology.

[81]  Saso Dzeroski,et al.  Inductive Logic Programming: Techniques and Applications , 1993 .

[82]  Satoru Miyano,et al.  Inferring gene networks from time series microarray data using dynamic Bayesian networks , 2003, Briefings Bioinform..

[83]  Haidong Wang,et al.  Discovering molecular pathways from protein interaction and gene expression data , 2003, ISMB.

[84]  Sangsoo Kim,et al.  Combining multiple microarray studies and modeling interstudy variation , 2003, ISMB.

[85]  Amos Tanay,et al.  MinReg: A Scalable Algorithm for Learning Parsimonious Regulatory Networks in Yeast and Mammals , 2006, J. Mach. Learn. Res..

[86]  Satoru Miyano,et al.  Estimating gene regulatory networks and protein-protein interactions of Saccharomyces cerevisiae from multiple genome-wide data , 2005, ECCB/JBI.

[87]  Renée X. de Menezes,et al.  Gene expression profiling highlights defective myogenesis in DMD patients and a possible role for bone morphogenetic protein 4 , 2006, Neurobiology of Disease.

[88]  Nir Friedman,et al.  Learning Module Networks , 2002, J. Mach. Learn. Res..

[89]  Feng Gao,et al.  Defining transcriptional networks through integrative modeling of mRNA expression and transcription factor binding data , 2004, BMC Bioinformatics.

[90]  Jean Yee Hwa Yang,et al.  Analysis of CDNA Microarray Images , 2001, Briefings Bioinform..

[91]  Eyad Almasri,et al.  A statistical method to incorporate biological knowledge for generating testable novel gene regulatory interactions from microarray experiments , 2007, BMC Bioinformatics.

[92]  Sangsoo Kim,et al.  Gene expression Differential coexpression analysis using microarray data and its application to human cancer , 2005 .

[93]  M. Gerstein,et al.  Genomic analysis of gene expression relationships in transcriptional regulatory networks. , 2003, Trends in genetics : TIG.

[94]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[95]  E. M. Wright,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[96]  Arno Siebes,et al.  REPORT RAPPORT , 2022 .

[97]  Satoru Miyano,et al.  Combining Microarrays and Biological Knowledge for Estimating Gene Networks via Bayesian Networks , 2004, J. Bioinform. Comput. Biol..

[98]  Sergei Egorov,et al.  Pathway studio - the analysis and navigation of molecular networks , 2003, Bioinform..

[99]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001, Nature Genetics.

[100]  Robert E. Schapire,et al.  The Boosting Approach to Machine Learning An Overview , 2003 .

[101]  Alvis Brazma,et al.  Current approaches to gene regulatory network modelling , 2007, BMC Bioinformatics.

[102]  S. Kauffman Metabolic stability and epigenesis in randomly constructed genetic nets. , 1969, Journal of theoretical biology.

[103]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[104]  Kevin Murphy,et al.  Modelling Gene Expression Data using Dynamic Bayesian Networks , 2006 .

[105]  Joachim Selbig,et al.  Transcription factor target prediction using multiple short expression time series from Arabidopsis thaliana , 2007, BMC Bioinformatics.

[106]  D. Pe’er,et al.  Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data , 2003, Nature Genetics.

[107]  A. Arkin,et al.  Stochastic mechanisms in gene expression. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[108]  Mahesan Niranjan,et al.  Enhancing Automatic Construction of Gene Subnetworks by Integrating Multiple Sources of Information , 2008, J. Signal Process. Syst..

[109]  Kathleen Marchal,et al.  SynTReN: a generator of synthetic gene expression data for design and analysis of structure learning algorithms , 2006, BMC Bioinformatics.

[110]  Judea Pearl,et al.  A Theory of Inferred Causation , 1991, KR.

[111]  Tommi S. Jaakkola,et al.  Bayesian Methods for Elucidating Genetic Regulatory Networks , 2002, IEEE Intell. Syst..

[112]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[113]  Jian Li,et al.  On information criteria and the generalized likelihood ratio test of model order selection , 2004, IEEE Signal Processing Letters.

[114]  Bruce Abramson,et al.  The Topological Fusion of Bayes Nets , 1992, UAI.

[115]  Alexander J. Hartemink,et al.  Informative Structure Priors: Joint Learning of Dynamic Regulatory Networks from Multiple Types of Data , 2004, Pacific Symposium on Biocomputing.

[116]  Carole L Yauk,et al.  Comprehensive comparison of six microarray technologies. , 2004, Nucleic acids research.

[117]  Eyad Almasri,et al.  Incorporating Literature Knowledge in Bayesian Network for Inferring Gene Networks with Gene Expression Data , 2008, ISBRA.

[118]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[119]  Yoshihide Hayashizaki,et al.  Construction of reliable protein-protein interaction networks with a new interaction generality measure , 2003, Bioinform..

[120]  T. Hughes,et al.  Genome-Wide Analysis of mRNA Stability Using Transcription Inhibitors and Microarrays Reveals Posttranscriptional Control of Ribosome Biogenesis Factors , 2004, Molecular and Cellular Biology.

[121]  Partha S. Vasisht Computational Analysis of Microarray Data , 2003 .

[122]  Martijn J. Schuemie,et al.  Literature-based concept profiles for gene annotation: The issue of weighting , 2008, Int. J. Medical Informatics.

[123]  J. Collins,et al.  Large-Scale Mapping and Validation of Escherichia coli Transcriptional Regulation from a Compendium of Expression Profiles , 2007, PLoS biology.

[124]  R. Lempicki,et al.  Evaluation of gene expression measurements from commercial microarray platforms. , 2003, Nucleic acids research.

[125]  D. Husmeier,et al.  Reconstructing Gene Regulatory Networks with Bayesian Networks by Combining Expression Data with Multiple Sources of Prior Knowledge , 2007, Statistical applications in genetics and molecular biology.

[126]  B J Stapley,et al.  Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in Medline abstracts. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[127]  L. D. Raedt,et al.  Three companions for data mining in first order logic , 2001 .

[128]  Petri Auvinen,et al.  Are data from different gene expression microarray platforms comparable? , 2004, Genomics.

[129]  Erik M. van Mulligen,et al.  Constructing an associative concept space for literature-based discovery , 2004, J. Assoc. Inf. Sci. Technol..

[130]  M. Bittner,et al.  Expression profiling in cancer using cDNA microarrays , 1999, Electrophoresis.

[131]  Marco Grzegorczyk,et al.  Comparative evaluation of reverse engineering gene regulatory networks with relevance networks, graphical gaussian models and bayesian networks , 2006, Bioinform..

[132]  M. Gerstein,et al.  A Bayesian Networks Approach for Predicting Protein-Protein Interactions from Genomic Data , 2003, Science.

[133]  Martijn J. Schuemie,et al.  Peregrine: Lightweight gene name normalization by dictionary lookup , 2007 .

[134]  Nicola J. Rinaldi,et al.  Computational discovery of gene modules and regulatory networks , 2003, Nature Biotechnology.

[135]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[136]  Mahitosh Mandal,et al.  Interferon-induces expression of cyclin-dependent kinase-inhibitors p21WAF1 and p27Kip1 that prevent activation of cyclin-dependent kinase by CDK-activating kinase (CAK) , 1998, Oncogene.

[137]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[138]  Yoshihiro Yamanishi,et al.  KEGG for linking genomes to life and the environment , 2007, Nucleic Acids Res..

[139]  Nicola J. Rinaldi,et al.  Transcriptional Regulatory Networks in Saccharomyces cerevisiae , 2002, Science.

[140]  Michal Linial,et al.  Using Bayesian Networks to Analyze Expression Data , 2000, J. Comput. Biol..

[141]  David J. Hand,et al.  ROC Curves for Continuous Data , 2009 .

[142]  Barend Mons,et al.  Text-derived concept profiles support assessment of DNA microarray data for acute myeloid leukemia and for androgen receptor stimulation , 2007, BMC Bioinformatics.

[143]  Nir Friedman,et al.  Data Analysis with Bayesian Networks: A Bootstrap Approach , 1999, UAI.

[144]  Martin Vingron,et al.  Processing and quality control of DNA array hybridization data , 2000, Bioinform..

[145]  J. Vohradský Neural Model of the Genetic Network* , 2001, The Journal of Biological Chemistry.

[146]  Pooja Jain,et al.  The YEASTRACT database: a tool for the analysis of transcription regulatory associations in Saccharomyces cerevisiae , 2005, Nucleic Acids Res..

[147]  Satoru Miyano,et al.  Estimation of Genetic Networks and Functional Structures Between Genes by Using Bayesian Networks and Nonparametric Regression , 2001, Pacific Symposium on Biocomputing.

[148]  Philip S. Yu,et al.  A graph-based approach to systematically reconstruct human transcriptional regulatory modules , 2007, ISMB/ECCB.

[149]  Laurie J. Heyer,et al.  Exploring expression data: identification and analysis of coexpressed genes. , 1999, Genome research.