Evolutionary Triplet Models of Structured RNA

The reconstruction and synthesis of ancestral RNAs is a feasible goal for paleogenetics. This will require new bioinformatics methods, including a robust statistical framework for reconstructing histories of substitutions, indels and structural changes. We describe a “transducer composition” algorithm for extending pairwise probabilistic models of RNA structural evolution to models of multiple sequences related by a phylogenetic tree. This algorithm draws on formal models of computational linguistics as well as the 1985 protosequence algorithm of David Sankoff. The output of the composition algorithm is a multiple-sequence stochastic context-free grammar. We describe dynamic programming algorithms, which are robust to null cycles and empty bifurcations, for parsing this grammar. Example applications include structural alignment of non-coding RNAs, propagation of structural information from an experimentally-characterized sequence to its homologs, and inference of the ancestral structure of a set of diverged RNAs. We implemented the above algorithms for a simple model of pairwise RNA structural evolution; in particular, the algorithms for maximum likelihood (ML) alignment of three known RNA structures and a known phylogeny and inference of the common ancestral structure. We compared this ML algorithm to a variety of related, but simpler, techniques, including ML alignment algorithms for simpler models that omitted various aspects of the full model and also a posterior-decoding alignment algorithm for one of the simpler models. In our tests, incorporation of basepair structure was the most important factor for accurate alignment inference; appropriate use of posterior-decoding was next; and fine details of the model were least important. Posterior-decoding heuristics can be substantially faster than exact phylogenetic inference, so this motivates the use of sum-over-pairs heuristics where possible (and approximate sum-over-pairs). For more exact probabilistic inference, we discuss the use of transducer composition for ML (or MCMC) inference on phylogenies, including possible ways to make the core operations tractable.

[1]  Lior Pachter,et al.  Multiple alignment by sequence annealing , 2007, Bioinform..

[2]  Lawrence Hunter,et al.  Pacific symposium on biocomputing 2006 , 2005, PSB 2016.

[3]  Tsutomu Suzuki,et al.  Ribosomal RNAs are tolerant toward genetic insertions: evolutionary origin of the expansion segments , 2008, Nucleic acids research.

[4]  S. Benner,et al.  Resurrecting ancestral alcohol dehydrogenases from yeast , 2005, Nature Genetics.

[5]  M. Suchard,et al.  Joint Bayesian estimation of alignment and phylogeny. , 2005, Systematic biology.

[6]  M. Donoghue,et al.  Recreating a functional ancestral archosaur visual pigment. , 2002, Molecular biology and evolution.

[7]  S. Eddy Non–coding RNA genes and the modern RNA world , 2001, Nature Reviews Genetics.

[8]  J. Hein,et al.  Statistical alignment: computational properties, homology testing and goodness-of-fit. , 2000, Journal of molecular biology.

[9]  R. Durbin,et al.  RNA sequence analysis using covariance models. , 1994, Nucleic acids research.

[10]  G. Wagner,et al.  Translation initiation: structures, mechanisms and evolution , 2004, Quarterly Reviews of Biophysics.

[11]  Robin Ray Gutell,et al.  Collection of small subunit (16S- and 16S-like) ribosomal RNA structures , 1993, Nucleic Acids Res..

[12]  Yun S. Song,et al.  An Efficient Algorithm for Statistical Multiple Alignment on Arbitrary Phylogenetic Trees , 2003, J. Comput. Biol..

[13]  P. Forterre Three RNA cells for ribosomal lineages and three DNA viruses to replicate their genomes: a hypothesis for the origin of cellular domain. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[14]  David Haussler,et al.  Identification and Classification of Conserved RNA Secondary Structures in the Human Genome , 2006, PLoS Comput. Biol..

[15]  A. Muller,et al.  Thermosynthesis as energy source for the RNA World: a model for the bioenergetics of the origin of life. , 2005, Bio Systems.

[16]  F. H. C. CRICK,et al.  Origin of the Genetic Code , 1967, Nature.

[17]  J. Felsenstein,et al.  An evolutionary model for maximum likelihood alignment of DNA sequences , 1991, Journal of Molecular Evolution.

[18]  W. Feller,et al.  An Introduction to Probability Theory and Its Applications, Vol. II , 1972, The Mathematical Gazette.

[19]  D. Penny,et al.  Branch and bound algorithms to determine minimal evolutionary trees , 1982 .

[20]  N. B. Leontisa,et al.  Motif prediction in ribosomal RNAs Lessons and prospects for automated motif prediction in homologous RNA molecules , 2002 .

[21]  R. Gutell,et al.  Collection of small subunit (16S- and 16S-like) ribosomal RNA structures: 1994. , 1993, Nucleic acids research.

[22]  Jianzhi Zhang,et al.  Complementary advantageous substitutions in the evolution of an antiviral RNase of higher primates , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[23]  V. Ambros,et al.  The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14 , 1993, Cell.

[24]  James W. Thatcher,et al.  Generalized Sequential Machine Maps , 1970, J. Comput. Syst. Sci..

[25]  Saurabh Sinha,et al.  Indelign: a probabilistic framework for annotation of insertions and deletions in a multiple alignment , 2007, Bioinform..

[26]  Eugene V. Koonin,et al.  Introns and the origin of nucleus–cytosol compartmentalization , 2006, Nature.

[27]  D. Sankoff Simultaneous Solution of the RNA Folding, Alignment and Protosequence Problems , 1985 .

[28]  John M. Hancock,et al.  'Compensatory slippage' in the evolution of ribosomal RNA genes. , 1990, Nucleic acids research.

[29]  T. Cavalier-smith,et al.  Rooting the tree of life by transition analyses , 2006, Biology Direct.

[30]  S. Benner,et al.  Inferring the palaeoenvironment of ancient bacteria on the basis of resurrected proteins , 2003, Nature.

[31]  Lior Pachter,et al.  Alignment Metric Accuracy , 2005, q-bio/0510052.

[32]  S. Whelan,et al.  A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. , 2001, Molecular biology and evolution.

[33]  Christopher J. Lee,et al.  Multiple sequence alignment using partial order graphs , 2002, Bioinform..

[34]  Brian W. Matthews,et al.  Ancestral lysozymes reconstructed, neutrality tested, and thermostability linked to hydrocarbon packing , 1990, Nature.

[35]  Feng Chen,et al.  Sequencing and Analysis of Neanderthal Genomic DNA , 2006, Science.

[36]  István Miklós,et al.  SimulFold: Simultaneously Inferring RNA Structures Including Pseudoknots, Alignments, and Trees Using a Bayesian MCMC Framework , 2007, PLoS Comput. Biol..

[37]  G. Pruijn,et al.  Conserved features of Y RNAs: a comparison of experimentally derived secondary structures. , 2000, Nucleic acids research.

[38]  Gaurav Sharma,et al.  Efficient pairwise RNA structure prediction using probabilistic alignment constraints in Dynalign , 2007, BMC Bioinformatics.

[39]  Hubert Comon,et al.  Tree automata techniques and applications , 1997 .

[40]  Ian Holmes,et al.  A probabilistic model for the evolution of RNA structure , 2004, BMC Bioinformatics.

[41]  Ian Holmes Phylocomposer and phylodirector: analysis and visualization of transducer indel models , 2007, Bioinform..

[42]  U. Schmidt,et al.  Group II Introns: Structure and Catalytic Versatility of Large Natural Ribozymes , 2003, Critical reviews in biochemistry and molecular biology.

[43]  D. Liberles Ancestral sequence reconstruction , 2007 .

[44]  A. Sparks,et al.  Molecular resurrection of an extinct ancestral promoter for mouse L1. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[45]  Ian Holmes,et al.  Pairwise RNA Structure Comparison with Stochastic Context-Free Grammars , 2001, Pacific Symposium on Biocomputing.

[46]  Mike A. Steel,et al.  Applying the Thorne-Kishino-Felsenstein model to sequence evolution on a star-shaped tree , 2001, Appl. Math. Lett..

[47]  Jeffrey E. Barrick,et al.  Riboswitches Control Fundamental Biochemical Pathways in Bacillus subtilis and Other Bacteria , 2003, Cell.

[48]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[49]  William C. Rounds,et al.  Mappings and grammars on trees , 1970, Mathematical systems theory.

[50]  D. Penny Inferring Phylogenies.—Joseph Felsenstein. 2003. Sinauer Associates, Sunderland, Massachusetts. , 2004 .

[51]  E. Birney,et al.  Genome-wide nucleotide-level mammalian ancestor reconstruction. , 2008, Genome research.

[52]  R. C. Underwood,et al.  Stochastic context-free grammars for tRNA modeling. , 1994, Nucleic acids research.

[53]  R. Elston,et al.  A general model for the genetic analysis of pedigree data. , 1971, Human heredity.

[54]  Andreas Wilm,et al.  An enhanced RNA alignment benchmark for sequence alignment programs , 2006, Algorithms for Molecular Biology.

[55]  W. Feller,et al.  An Introduction to Probability Theory and Its Applications , 1951 .

[56]  Y Van de Peer,et al.  Database on the structure of large ribosomal subunit RNA. , 1997, Nucleic acids research.

[57]  Sean R. Eddy,et al.  Rfam: an RNA family database , 2003, Nucleic Acids Res..

[58]  J. L. Jensen,et al.  GIBBS SAMPLER FOR STATISTICAL MULTIPLE ALIGNMENT , 2005 .

[59]  D. Kendall On the Generalized "Birth-and-Death" Process , 1948 .

[60]  A. Yakhnin A Model for the Origin of Protein Synthesis as Coreplicational Scanning of Nascent RNA , 2007, Origins of Life and Evolution of Biospheres.

[61]  Scott R. Presnell,et al.  The ribonuclease from an extinct bovid ruminant , 1990, FEBS letters.

[62]  R. Plasterk,et al.  Molecular Reconstruction of Sleeping Beauty , a Tc1-like Transposon from Fish, and Its Transposition in Human Cells , 1997, Cell.

[63]  Jotun Hein,et al.  An Algorithm for Statistical Alignment of Sequences Related by a Binary Tree , 2000, Pacific Symposium on Biocomputing.

[64]  Gerton Lunter HMMoC - a compiler for hidden Markov models , 2007, Bioinform..

[65]  Ian Holmes,et al.  Dynamic programming alignment accuracy , 1998, RECOMB '98.

[66]  Judea Pearl,et al.  Reverend Bayes on Inference Engines: A Distributed Hierarchical Approach , 1982, AAAI.

[67]  I Holmes,et al.  An expectation maximization algorithm for training hidden substitution models. , 2002, Journal of molecular biology.

[68]  Linus Pauling,et al.  Chemical Paleogenetics. Molecular "Restoration Studies" of Extinct Forms of Life. , 1963 .

[69]  I. Holmes,et al.  A "Long Indel" model for evolutionary sequence alignment. , 2003, Molecular biology and evolution.

[70]  Temple F. Smith,et al.  The origin and evolution of the ribosome , 2008, Biology Direct.

[71]  John M. Hancock,et al.  Evolution of the secondary structures and compensatory mutations of the ribosomal RNAs of Drosophila melanogaster. , 1988, Molecular biology and evolution.

[72]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[73]  Mathieu Blanchette,et al.  Exact and Heuristic Algorithms for the Indel Maximum Likelihood Problem , 2007, J. Comput. Biol..

[74]  Elena Rivas,et al.  Noncoding RNA gene detection using comparative sequence analysis , 2001, BMC Bioinformatics.

[75]  X. Gu,et al.  Identification of essential amino acid changes in paired domain evolution using a novel combination of evolutionary analysis and in vitro and in vivo studies. , 2002, Molecular biology and evolution.

[76]  Sergey Steinberg,et al.  Compilation of tRNA sequences and sequences of tRNA genes , 2004, Nucleic Acids Res..

[77]  E Rivas,et al.  A dynamic programming algorithm for RNA structure prediction including pseudoknots. , 1998, Journal of molecular biology.

[78]  Fernando Pereira,et al.  Weighted finite-state transducers in speech recognition , 2002, Comput. Speech Lang..

[79]  R. Nielsen,et al.  Mutations as missing data: inferences on the ages and distributions of nonsynonymous and synonymous mutations. , 2001, Genetics.

[80]  Bernard B. Suh,et al.  Reconstructing contiguous regions of an ancestral genome. , 2006, Genome research.

[81]  M. Miyamoto,et al.  Sequence alignments and pair hidden Markov models using evolutionary history. , 2003, Journal of molecular biology.

[82]  Ian Holmes,et al.  Evolutionary HMMs: a Bayesian approach to multiple alignment , 2001, Bioinform..

[83]  Olivier Poch,et al.  A comprehensive comparison of multiple sequence alignment programs , 1999, Nucleic Acids Res..

[84]  Tamás Kiss,et al.  Analysis of the structure of human telomerase RNA in vivo. , 2002, Nucleic acids research.

[85]  Antoine Danchin,et al.  The extant core bacterial proteome is an archive of the origin of life , 2007, Proteomics.

[86]  Lior Pachter,et al.  Combining statistical alignment and phylogenetic footprinting to detect regulatory elements , 2008, Bioinform..

[87]  Ian Holmes,et al.  Stem Stem Stem Stem Loop Loop Loop LoopLoop Loop Loop Loop Loop Loop Loop , 2005 .

[88]  Ian Holmes,et al.  Transducers: an emerging probabilistic framework for modeling indels on trees , 2007, Bioinform..

[89]  Sean R. Eddy,et al.  Efficient pairwise RNA structure prediction and alignment using sequence alignment constraints , 2006, BMC Bioinformatics.

[90]  Sean R. Eddy,et al.  Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction , 2004, BMC Bioinformatics.

[91]  S. Eddy,et al.  A computational screen for methylation guide snoRNAs in yeast. , 1999, Science.

[92]  Bjarne Knudsen,et al.  Pfold: RNA Secondary Structure Prediction Using Stochastic Context-Free Grammars , 2003 .

[93]  E. Gaucher Ancestral sequence reconstruction as a tool to understand natural history and guide synthetic biology: realizing and extending the vision of Zuckerkandl and Pauling , 2007 .

[94]  I. Holmes,et al.  Using guide trees to construct multiple-sequence evolutionary HMMs , 2003, ISMB.

[95]  E. Ortlund,et al.  Crystal Structure of an Ancient Protein: Evolution by Conformational Epistasis , 2007, Science.

[96]  Ziheng Yang Estimating the pattern of nucleotide substitution , 1994, Journal of Molecular Evolution.

[97]  D. Haussler,et al.  Reconstructing large regions of an ancestral mammalian genome in silico. , 2004, Genome research.

[98]  A. Wilm,et al.  A benchmark of multiple sequence alignment programs upon structural RNAs , 2005, Nucleic acids research.

[99]  Lior Pachter,et al.  Specific alignment of structured RNA: stochastic grammars and sequence annealing , 2008, Bioinform..

[100]  Elena Rivas,et al.  Probabilistic Phylogenetic Inference with Insertions and Deletions , 2008, PLoS Comput. Biol..

[101]  István Miklós,et al.  Statistical Alignment: Recent Progress, New Applications, and Challenges , 2005 .

[102]  Sean R. Eddy,et al.  RSEARCH: Finding homologs of single structured RNA sequences , 2003, BMC Bioinformatics.

[103]  Sean R. Eddy,et al.  Query-Dependent Banding (QDB) for Faster RNA Similarity Searches , 2007, PLoS Comput. Biol..

[104]  Yasubumi Sakakibara,et al.  Pair hidden Markov models on tree structures , 2003, ISMB.

[105]  A. Edwards,et al.  The reconstruction of evolution , 1963 .

[106]  Tamir Tuller,et al.  Reconstruction of Ancestral Genomic Sequences Using Likelihood , 2007, J. Comput. Biol..