Machine learning methods for the discovery of regulatory elements in bacteria

Technological advances are increasing both the volume and kinds of biological data being generated. These data sets hold great promise for exciting advances in biology and medicine. Because of their sizes, though, manual analysis is often not practical, and novel computational approaches are needed. This thesis investigates the use of machine learning methods for discovering an important class of DNA sequences, known as regulatory elements, that are encoded in the genomes of bacteria. One set of contributions of this thesis are those related to computational biology. We develop probabilistic models of three types of regulatory elements (promoters, terminators and operons). Key properties of our approach are that it combines heterogeneous evidence sources, predicts all three types of regulatory elements in a single model, and predicts regulatory elements in a set of bacterial genomes simultaneously. We present experiments that show our promoter, terminator and operon predictions all exceed the previous state of the art in terms of accuracy. Another set of contributions are those related to machine learning. Two of these contributions are novel methods for learning the parameters and structure of a probabilistic grammar. Our empirical evaluation shows that both approaches lead to improved accuracy on a terminator prediction task. Another machine learning contribution of this thesis is a semi-supervised approach to learning from “weakly-labeled” training examples. We show how to acquire and use weakly-labeled examples by exploiting relationships among concepts. Our empirical evaluation shows that these examples can increase accuracy for some training set sizes. A final machine learning contribution of this thesis is a probabilistic framework for representing and predicting overlapping elements in sequence data. Unlike hidden Markov models, which assign labels to individual positions of a sequence, our approach assigns labels to whole subsequences. Experiments designed to test the accuracy of our method show that our approach is more accurate than two alternatives. While each of these machine learning contributions are motivated by properties of the regulatory element discovery problem, they are general and apply to other domains as well.

[1]  M. Borodovsky,et al.  GeneMark.hmm: new solutions for gene finding. , 1998, Nucleic acids research.

[2]  Richard E. Neapolitan,et al.  Learning Bayesian networks , 2007, KDD '07.

[3]  Terrence J. Sejnowski,et al.  A Parallel Network that Learns to Play Backgammon , 1989, Artif. Intell..

[4]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[5]  Thomas G. Dietterich,et al.  Solving the Multiple Instance Problem with Axis-Parallel Rectangles , 1997, Artif. Intell..

[6]  S. P. Fodor,et al.  High density synthetic oligonucleotide arrays , 1999, Nature Genetics.

[7]  David Page,et al.  A Probabilistic Learning Approach to Whole-Genome Operon Prediction , 2000, ISMB.

[8]  R. C. Underwood,et al.  Stochastic context-free grammars for tRNA modeling. , 1994, Nucleic acids research.

[9]  Stuart J. Russell,et al.  Dynamic bayesian networks: representation, inference and learning , 2002 .

[10]  Peter D. Karp,et al.  The EcoCyc Database , 2002, Nucleic Acids Res..

[11]  J. Szustakowski,et al.  Computational identification of operons in microbial genomes. , 2002, Genome research.

[12]  Michael P. S. Brown,et al.  Small Subunit Ribosomal RNA Modeling Using Stochastic Context-Free Grammars , 2000, ISMB.

[13]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[14]  E N Trifonov,et al.  Terminators of transcription with RNA polymerase from Escherichia coli: what they look like and how to find them. , 1986, Journal of biomolecular structure & dynamics.

[15]  Sean R. Eddy,et al.  Rna structural alignment using stochastic context-free grammars , 2004 .

[16]  R. Overbeek,et al.  The use of gene clusters to infer functional coupling. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[18]  S. Salzberg,et al.  Prediction of transcription terminators in bacterial genomes. , 2000, Journal of molecular biology.

[19]  Rob Malouf,et al.  A Comparison of Algorithms for Maximum Entropy Parameter Estimation , 2002, CoNLL.

[20]  M. Borodovsky,et al.  GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. , 2001, Nucleic acids research.

[21]  S. Salzberg,et al.  Microbial gene identification using interpolated Markov models. , 1998, Nucleic acids research.

[22]  S. Salzberg,et al.  Prediction of operons in microbial genomes. , 2001, Nucleic acids research.

[23]  T. D. Schneider,et al.  Characterization of Translational Initiation Sites in E. Coui , 1982 .

[24]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[25]  M. Pazzani,et al.  The Utility of Knowledge in Inductive Learning , 1992, Machine Learning.

[26]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[27]  Andreas Stolcke,et al.  Bayesian learning of probabilistic language models , 1994 .

[28]  J. Shavlik,et al.  Using Multiple Levels of Learning and Diverse Evidence Sources to Uncover Coordinately Controlled Genes , 2000 .

[29]  Jean Thierry-Mieg,et al.  A global analysis of Caenorhabditis elegans operons , 2002, Nature.

[30]  ChickeringDavid Maxwell Learning equivalence classes of bayesian-network structures , 2002 .

[31]  Temple F. Smith,et al.  Operons in Escherichia coli: genomic analyses and predictions. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Raymond J. Mooney,et al.  Theory Refinement Combining Analytical and Empirical Methods , 1994, Artif. Intell..

[33]  T Yada,et al.  Modeling and predicting transcriptional units of Escherichia coli genes using hidden Markov models. , 1999, Bioinformatics.

[34]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[35]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[36]  Ben Taskar,et al.  Discriminative Probabilistic Models for Relational Data , 2002, UAI.

[37]  Michael I. Jordan,et al.  Loopy Belief Propagation for Approximate Inference: An Empirical Study , 1999, UAI.

[38]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[39]  A. N. Spiridonov,et al.  Congruent evolution of different classes of non-coding DNA in prokaryotic genomes. , 2002, Nucleic acids research.

[40]  J. Ross Quinlan,et al.  Learning Efficient Classification Procedures and Their Application to Chess End Games , 1983 .

[41]  D. Haussler,et al.  A hidden Markov model that finds genes in E. coli DNA. , 1994, Nucleic acids research.

[42]  C A Nelson,et al.  Learning to Learn , 2017, Encyclopedia of Machine Learning and Data Mining.

[43]  Ron Kohavi,et al.  The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.

[44]  Daphne Koller,et al.  Genome-wide discovery of transcriptional modules from DNA sequence and gene expression , 2003, ISMB.

[45]  Mark Craven,et al.  Exploiting Relations Among Concepts to Acquire Weakly Labeled Training Data , 2002, International Conference on Machine Learning.

[46]  D. Pe’er,et al.  Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data , 2003, Nature Genetics.

[47]  E. Brody,et al.  Prediction of rho-independent Escherichia coli transcription terminators. A statistical analysis of their RNA stem-loop structures. , 1990 .

[48]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[49]  F. Neidhardt,et al.  Culture Medium for Enterobacteria , 1974, Journal of bacteriology.

[50]  David Haussler,et al.  Using the Fisher Kernel Method to Detect Remote Protein Homologies , 1999, ISMB.

[51]  Pierre Baldi,et al.  Characterization of Prokaryotic and Eukaryotic Promoters Using Hidden Markov Models , 1996, ISMB.

[52]  Jeremy Buhler,et al.  Operon prediction without a training set , 2005, Bioinform..

[53]  David Page,et al.  A Bayesian Network Approach to Operon Prediction , 2003, Bioinform..

[54]  Yu Qiu,et al.  Predicting bacterial transcription units using sequence and expression data , 2003, ISMB.

[55]  Yan Zhou,et al.  Enhancing Supervised Learning with Unlabeled Data , 2000, ICML.

[56]  James P. Egan,et al.  Signal detection theory and ROC analysis , 1975 .

[57]  Vladimir Solmon,et al.  The estimation of stochastic context-free grammars using the Inside-Outside algorithm , 2003 .

[58]  David Page,et al.  Using Multiple Levels of Learning and Diverse Evidence to Uncover Coordinately Controlled Genes , 2000, ICML.

[59]  Mark Craven,et al.  Markov Networks for Detecting Overalpping Elements in Sequence Data , 2004, NIPS.

[60]  Michael I. Jordan,et al.  Factorial Hidden Markov Models , 1995, Machine Learning.

[61]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[62]  G. Church,et al.  RNA expression analysis using a 30 base pair resolution Escherichia coli genome array , 2000, Nature Biotechnology.

[63]  Steve Young,et al.  Applications of stochastic context-free grammars using the Inside-Outside algorithm , 1990 .

[64]  Mark Craven,et al.  Refining the Structure of a Stochastic Context-Free Grammar , 2001, IJCAI.

[65]  Anders Krogh Hidden Markov models for labeled sequences , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[66]  S Karlin,et al.  Codon usages in different gene classes of the Escherichia coli genome , 1998, Molecular microbiology.

[67]  Charles DeLisi,et al.  Identifying functional links between genes using conserved chromosomal proximity. , 2002, Trends in genetics : TIG.

[68]  David R. Haynor,et al.  Identifying operons and untranslated regions of transcripts using Escherichia coli RNA expression analysis , 2002, ISMB.

[69]  J. Mcneil,et al.  Prediction of rho-independent transcriptional terminators in Escherichia coli. , 2001, Nucleic acids research.

[70]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[71]  David Haussler,et al.  RNA Modeling Using Gibbs Sampling and Stochastic Context Free Grammars , 1994, ISMB.

[72]  L. Penland,et al.  Use of a cDNA microarray to analyse gene expression patterns in human cancer , 1996, Nature Genetics.

[73]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[74]  P Haddawy,et al.  Construction of a Bayesian network for mammographic diagnosis of breast cancer , 1997, Comput. Biol. Medicine.

[75]  C. Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Machine Learning.

[76]  Katherine H. Huang,et al.  A novel method for accurate operon predictions in all sequenced prokaryotes , 2005, Nucleic acids research.

[77]  R. Durbin,et al.  RNA sequence analysis using covariance models. , 1994, Nucleic acids research.

[78]  Chiara Sabatti,et al.  Co-expression pattern from DNA microarray experiments as a tool for operon prediction , 2002, Nucleic Acids Res..

[79]  N. W. Davis,et al.  The complete genome sequence of Escherichia coli K-12. , 1997, Science.

[80]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[81]  S. Salzberg,et al.  Improved microbial gene identification with GLIMMER. , 1999, Nucleic acids research.

[82]  Tom M. Mitchell,et al.  Discovering Test Set Regularities in Relational Domains , 2000, ICML.

[83]  Michael K. Gilson,et al.  ASAP, a systematic annotation package for community analysis of genomes , 2003, Nucleic Acids Res..

[84]  Andrew McCallum,et al.  Piecewise Training for Undirected Models , 2005, UAI.

[85]  Leslie Grate,et al.  Automatic RNA Secondary Structure Determination with Stochastic Context-Free Grammars , 1995, ISMB.

[86]  Julio Collado-Vides,et al.  A powerful non-homology method for the prediction of operons in prokaryotes , 2002, ISMB.

[87]  Nils J. Nilsson,et al.  Artificial Intelligence , 1974, IFIP Congress.

[88]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[89]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[90]  Daniel P. Huttenlocher,et al.  Efficient matching of pictorial structures , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[91]  Denis Thieffry,et al.  Prediction of transcriptional regulatory sites in the complete genome sequence of Escherichia coli K-12 , 1998, Bioinform..

[92]  Stanley F. Chen,et al.  Building Probabilistic Models for Natural Language , 1996, ArXiv.

[93]  Jude W. Shavlik,et al.  Knowledge-Based Artificial Neural Networks , 1994, Artif. Intell..

[94]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[95]  David Page,et al.  Modelling regulatory pathways in E. coli from time series expression profiles , 2002, ISMB.

[96]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[97]  Hanspeter Herzel,et al.  Correlations in DNA sequences: The role of protein coding segments , 1997 .

[98]  P Bork,et al.  Gene context conservation of a higher order than operons. , 2000, Trends in biochemical sciences.

[99]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.