Modeling the Evolution of Regulatory Elements by Simultaneous Detection and Alignment with Phylogenetic Pair HMMs

The computational detection of regulatory elements in DNA is a difficult but important problem impacting our progress in understanding the complex nature of eukaryotic gene regulation. Attempts to utilize cross-species conservation for this task have been hampered both by evolutionary changes of functional sites and poor performance of general-purpose alignment programs when applied to non-coding sequence. We describe a new and flexible framework for modeling binding site evolution in multiple related genomes, based on phylogenetic pair hidden Markov models which explicitly model the gain and loss of binding sites along a phylogeny. We demonstrate the value of this framework for both the alignment of regulatory regions and the inference of precise binding-site locations within those regions. As the underlying formalism is a stochastic, generative model, it can also be used to simulate the evolution of regulatory elements. Our implementation is scalable in terms of numbers of species and sequence lengths and can produce alignments and binding-site predictions with accuracy rivaling or exceeding current systems that specialize in only alignment or only binding-site prediction. We demonstrate the validity and power of various model components on extensive simulations of realistic sequence data and apply a specific model to study Drosophila enhancers in as many as ten related genomes and in the presence of gain and loss of binding sites. Different models and modeling assumptions can be easily specified, thus providing an invaluable tool for the exploration of biological hypotheses that can drive improvements in our understanding of the mechanisms and evolution of gene regulation.

[1]  Peter F Stadler,et al.  A stochastic model for the evolution of transcription factor binding site abundance. , 2007, Journal of theoretical biology.

[2]  Terrence S. Furey,et al.  F-Seq: a feature density estimator for high-throughput sequence tags , 2008, Bioinform..

[3]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[4]  David A. Nix,et al.  Large-Scale Turnover of Functional Transcription Factor Binding Sites in Drosophila , 2006, PLoS Comput. Biol..

[5]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[6]  A. Löytynoja,et al.  Phylogeny-Aware Gap Placement Prevents Errors in Sequence Alignment and Evolutionary Analysis , 2008, Science.

[7]  Sean R. Eddy,et al.  Biological sequence analysis: Contents , 1998 .

[8]  M. Kreitman,et al.  Evolutionary dynamics of the enhancer region of even-skipped in Drosophila. , 1995, Molecular biology and evolution.

[9]  William H. Majoros,et al.  Methods for computational gene prediction , 2007 .

[10]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[11]  Michael B. Eisen,et al.  Detecting the limits of regulatory element conservation and divergence estimation using pairwise and multiple alignments , 2006, BMC Bioinformatics.

[12]  W. J. Quesne The Uniquely Evolved Character Concept and its Cladistic Application , 1974 .

[13]  Uwe Ohler,et al.  Complexity reduction in context-dependent DNA substitution models , 2009, Bioinform..

[14]  Daniel S. Hirschberg,et al.  A linear space algorithm for computing maximal common subsequences , 1975, Commun. ACM.

[15]  Richard W. Lusk,et al.  Evolutionary Mirages: Selection on Binding Site Composition Creates the Illusion of Conserved Grammars in Drosophila Enhancers , 2010, PLoS genetics.

[16]  Uwe Ohler,et al.  Phylogenetic simulation of promoter evolution: estimation and modeling of binding site turnover events and assessment of their impact on alignment tools , 2007, Genome Biology.

[17]  J. Felsenstein,et al.  An evolutionary model for maximum likelihood alignment of DNA sequences , 1991, Journal of Molecular Evolution.

[18]  G. Wray The evolutionary significance of cis-regulatory mutations , 2007, Nature Reviews Genetics.

[19]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[20]  T. Speed,et al.  Biological Sequence Analysis , 1998 .

[21]  M. Suchard,et al.  Alignment Uncertainty and Genomic Analysis , 2008, Science.

[22]  Ian Holmes,et al.  Evolutionary HMMs: a Bayesian approach to multiple alignment , 2001, Bioinform..

[23]  D. Haussler,et al.  Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. , 2005, Genome research.

[24]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[25]  M. Levine,et al.  Regulation of a segmentation stripe by overlapping activators and repressors in the Drosophila embryo. , 1991, Science.

[26]  David Haussler,et al.  Combining phylogenetic and hidden Markov models in biosequence analysis , 2003, RECOMB '03.

[27]  T A Gray,et al.  Phylogenetic footprinting reveals a nuclear protein which binds to silencer sequences in the human gamma and epsilon globin genes , 1992, Molecular and cellular biology.

[28]  M. Kreitman,et al.  Functional Evolution of a cis-Regulatory Module , 2005, PLoS biology.

[29]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[30]  Eric P. Xing,et al.  CSMET: Comparative Genomic Motif Detection via Multi-Resolution Phylogenetic Shadowing , 2008, PLoS Comput. Biol..

[31]  A. Halpern,et al.  Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. , 1998, Molecular biology and evolution.

[32]  Steven M. Gallo,et al.  REDfly 2.0: an integrated database of cis-regulatory modules and transcription factor binding sites in Drosophila , 2007, Nucleic Acids Res..

[33]  John Hawkins,et al.  Assessing phylogenetic motif models for predicting transcription factor binding sites , 2009, Bioinform..

[34]  Lior Pachter,et al.  Combining statistical alignment and phylogenetic footprinting to detect regulatory elements , 2008, Bioinform..

[35]  Rahul Siddharthan,et al.  PhyloGibbs-MP: Module Prediction and Discriminative Motif-Finding by Gibbs Sampling , 2008, PLoS Comput. Biol..

[36]  Lior Pachter,et al.  Binding Site Turnover Produces Pervasive Quantitative Changes in Transcription Factor Binding between Closely Related Drosophila Species , 2010, PLoS biology.

[37]  Ole Winther,et al.  JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update , 2007, Nucleic Acids Res..

[38]  Van Nostrand,et al.  Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm , 1967 .

[39]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[40]  Xin He,et al.  Alignment and Prediction of cis-Regulatory Modules Based on a Probabilistic Model of Evolution , 2009, PLoS Comput. Biol..