Probabilistic Phylogenetic Inference with Insertions and Deletions

A fundamental task in sequence analysis is to calculate the probability of a multiple alignment given a phylogenetic tree relating the sequences and an evolutionary model describing how sequences change over time. However, the most widely used phylogenetic models only account for residue substitution events. We describe a probabilistic model of a multiple sequence alignment that accounts for insertion and deletion events in addition to substitutions, given a phylogenetic tree, using a rate matrix augmented by the gap character. Starting from a continuous Markov process, we construct a non-reversible generative (birth–death) evolutionary model for insertions and deletions. The model assumes that insertion and deletion events occur one residue at a time. We apply this model to phylogenetic tree inference by extending the program dnaml in phylip. Using standard benchmarking methods on simulated data and a new “concordance test” benchmark on real ribosomal RNA alignments, we show that the extended program dnaml ε improves accuracy relative to the usual approach of ignoring gaps, while retaining the computational efficiency of the Felsenstein peeling algorithm.

[1]  Elena Rivas,et al.  Evolutionary models for insertions and deletions in a probabilistic modeling framework , 2005, BMC Bioinformatics.

[2]  I. Holmes,et al.  A "Long Indel" model for evolutionary sequence alignment. , 2003, Molecular biology and evolution.

[3]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.

[4]  J. Huelsenbeck Performance of Phylogenetic Methods in Simulation , 1995 .

[5]  Lior Pachter,et al.  Multiple-sequence functional annotation and the generalized hidden Markov phylogeny , 2004, Bioinform..

[6]  M. Kimura A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences , 1980, Journal of Molecular Evolution.

[7]  Simon Whelan,et al.  Statistical Methods in Molecular Evolution , 2005 .

[8]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[9]  Mathieu Blanchette,et al.  On the Inference of Parsimonious Indel Evolutionary Scenarios , 2006, J. Bioinform. Comput. Biol..

[10]  G. Churchill Stochastic models for heterogeneous DNA sequences. , 1989, Bulletin of mathematical biology.

[11]  S Karlin,et al.  REPRESENTATION OF A CLASS OF STOCHASTIC Processes. , 1955, Proceedings of the National Academy of Sciences of the United States of America.

[12]  István Miklós,et al.  Bayesian coestimation of phylogeny and sequence alignment , 2005, BMC Bioinformatics.

[13]  C. Cannings Statistical Methods in Molecular Evolution , 2006 .

[14]  Jun Wang,et al.  MCALIGN2: Faster, accurate global pairwise alignment of non-coding DNA sequences based on explicit models of indel evolution , 2006, BMC Bioinformatics.

[15]  Steven A Benner,et al.  Empirical analysis of protein insertions and deletions determining parameters for the correct placement of gaps in protein sequence alignments. , 2004, Journal of molecular biology.

[16]  Pietro Liò,et al.  PASSML: combining evolutionary inference and protein secondary structure prediction , 1998, Bioinform..

[17]  D. Swofford PAUP*: Phylogenetic analysis using parsimony (*and other methods), Version 4.0b10 , 2002 .

[18]  Toby Johnson,et al.  MCALIGN: stochastic alignment of noncoding DNA sequences based on an evolutionary model of sequence evolution. , 2004, Genome research.

[19]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[20]  J. Adachi,et al.  MOLPHY, programs for molecular phylogenetics , 1992 .

[21]  Ziheng Yang,et al.  PAML: a program package for phylogenetic analysis by maximum likelihood , 1997, Comput. Appl. Biosci..

[22]  D. Haussler,et al.  Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. , 2005, Genome research.

[23]  Dirk Metzler,et al.  Statistical alignment based on fragment insertion and deletion models , 2003, Bioinform..

[24]  David Haussler,et al.  Combining phylogenetic and hidden Markov models in biosequence analysis , 2003, RECOMB '03.

[25]  M. Gribskov,et al.  Identification of sequence pattern with profile analysis. , 1996, Methods in enzymology.

[26]  Folker Meyer,et al.  Rose: generating sequence families , 1998, Bioinform..

[27]  M. P. Cummings,et al.  PAUP* Phylogenetic analysis using parsimony (*and other methods) Version 4 , 2000 .

[28]  Bjarne Knudsen,et al.  RNA secondary structure prediction using stochastic context-free grammars and evolutionary history , 1999, Bioinform..

[29]  B. Rannala,et al.  Probability distribution of molecular evolutionary trees: A new method of phylogenetic inference , 1996, Journal of Molecular Evolution.

[30]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[31]  R. Durbin,et al.  Tree-based maximal likelihood substitution matrices and hidden Markov models , 1995, Journal of Molecular Evolution.

[32]  J. Hein,et al.  Statistical alignment: computational properties, homology testing and goodness-of-fit. , 2000, Journal of molecular biology.

[33]  M A Newton,et al.  Bayesian Phylogenetic Inference via Markov Chain Monte Carlo Methods , 1999, Biometrics.

[34]  B. Larget,et al.  Markov Chain Monte Carlo Algorithms for the Bayesian Analysis of Phylogenetic Trees , 2000 .

[35]  Thomas W H Lui,et al.  Empirical models for substitution in ribosomal RNA. , 2003, Molecular biology and evolution.

[36]  Zoltán Toroczkai,et al.  An Improved Model for Statistical Alignment , 2001, WABI.

[37]  J. Felsenstein,et al.  A Hidden Markov Model approach to variation among sites in rate of evolution. , 1996, Molecular biology and evolution.

[38]  Cleve B. Moler,et al.  Nineteen Dubious Ways to Compute the Exponential of a Matrix, Twenty-Five Years Later , 1978, SIAM Rev..

[39]  Nan Yu,et al.  The Comparative RNA Web (CRW) Site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs , 2002, BMC Bioinformatics.

[40]  Saurabh Sinha,et al.  Indelign: a probabilistic framework for annotation of insertions and deletions in a multiple alignment , 2007, Bioinform..

[41]  István Miklós,et al.  Bayesian Phylogenetic Inference under a Statistical Insertion-Deletion Model , 2003, WABI.

[42]  S. Muse,et al.  Estimating synonymous and nonsynonymous substitution rates. , 1996, Molecular biology and evolution.

[43]  M. Gribskov,et al.  [13] Identification of sequence patterns with profile analysis , 1996 .

[44]  David C. Jones,et al.  Using evolutionary trees in protein secondary structure prediction and other comparative sequence analyses. , 1996, Journal of molecular biology.

[45]  M. Miyamoto,et al.  Sequence alignments and pair hidden Markov models using evolutionary history. , 2003, Journal of molecular biology.

[46]  Ian Holmes,et al.  Evolutionary HMMs: a Bayesian approach to multiple alignment , 2001, Bioinform..

[47]  Jens Ledet Jensen,et al.  Recursions for statistical multiple alignment , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[48]  G. Mitchison A Probabilistic Treatment of Phylogeny and Sequence Alignment , 1999, Journal of Molecular Evolution.

[49]  N. Grishin,et al.  Reconstruction of ancestral protein sequences and its applications , 2004, BMC Evolutionary Biology.

[50]  Mike A. Steel,et al.  Applying the Thorne-Kishino-Felsenstein model to sequence evolution on a star-shaped tree , 2001, Appl. Math. Lett..

[51]  J. Felsenstein,et al.  Inching toward reality: An improved likelihood model of sequence evolution , 2004, Journal of Molecular Evolution.

[52]  Yun S. Song,et al.  An Efficient Algorithm for Statistical Multiple Alignment on Arbitrary Phylogenetic Trees , 2003, J. Comput. Biol..

[53]  M. Bishop,et al.  Maximum likelihood alignment of DNA sequences. , 1986, Journal of molecular biology.

[54]  Bin Qian,et al.  Detecting distant homologs using phylogenetic tree‐based HMMs , 2003, Proteins.

[55]  G A Churchill,et al.  Estimation and reliability of molecular sequence alignments. , 1995, Biometrics.

[56]  Michael P. Cummings,et al.  PAUP* [Phylogenetic Analysis Using Parsimony (and Other Methods)] , 2004 .

[57]  Ziheng Yang Estimating the pattern of nucleotide substitution , 1994, Journal of Molecular Evolution.

[58]  D. Haussler,et al.  Reconstructing large regions of an ancestral mammalian genome in silico. , 2004, Genome research.

[59]  Andrew D. Smith,et al.  SIMPROT: Using an empirically determined indel distribution in simulations of protein evolution , 2005, BMC Bioinformatics.

[60]  Arndt von Haeseler,et al.  Simultaneous statistical multiple alignment and phylogeny reconstruction. , 2005, Systematic biology.

[61]  P. Green,et al.  Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[62]  I. Holmes,et al.  Using guide trees to construct multiple-sequence evolutionary HMMs , 2003, ISMB.

[63]  Richard A. Goldstein,et al.  Performance of an iterated T-HMM for homology detection , 2004, Bioinform..

[64]  N. Goldman,et al.  Codon-substitution models for heterogeneous selection pressure at amino acid sites. , 2000, Genetics.

[65]  S. Tavaré Some probabilistic and statistical problems in the analysis of DNA sequences , 1986 .

[66]  Ian Holmes,et al.  A probabilistic model for the evolution of RNA structure , 2004, BMC Bioinformatics.

[67]  Jakob Skou Pedersen,et al.  Gene finding with a hidden Markov model of genome structure and evolution , 2003, Bioinform..

[68]  Jotun Hein,et al.  An Algorithm for Statistical Alignment of Sequences Related by a Binary Tree , 2000, Pacific Symposium on Biocomputing.

[69]  Z. Yang,et al.  A space-time process model for the evolution of DNA sequences. , 1995, Genetics.

[70]  B Qian,et al.  Distribution of indel lengths , 2001, Proteins.

[71]  Z. Yang,et al.  Models of amino acid substitution and applications to mitochondrial protein evolution. , 1998, Molecular biology and evolution.

[72]  J. Felsenstein,et al.  An evolutionary model for maximum likelihood alignment of DNA sequences , 1991, Journal of Molecular Evolution.

[73]  J. Felsenstein,et al.  A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. , 1994, Molecular biology and evolution.

[74]  S. Whelan,et al.  A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. , 2001, Molecular biology and evolution.

[75]  Nick Goldman,et al.  A new criterion and method for amino acid classification. , 2004, Journal of theoretical biology.

[76]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[77]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[78]  D. Balding,et al.  Models of sequence evolution for DNA sequences containing gaps. , 2001, Molecular biology and evolution.

[79]  M. O. Dayhoff A model of evolutionary change in protein , 1978 .

[80]  John P. Huelsenbeck,et al.  MRBAYES: Bayesian inference of phylogenetic trees , 2001, Bioinform..

[81]  S. Jeffery Evolution of Protein Molecules , 1979 .

[82]  Bastien Boussau,et al.  Efficient likelihood computations with nonreversible models of evolution. , 2006, Systematic biology.

[83]  Lachlan James M. Coin,et al.  Improved techniques for the identification of pseudogenes , 2004, ISMB/ECCB.

[84]  Jan Gorodkin,et al.  Evolutionary rate variation and RNA secondary structure prediction , 2004, Comput. Biol. Chem..

[85]  Thomas Ludwig,et al.  RAxML-III: a fast program for maximum likelihood-based inference of large phylogenetic trees , 2005, Bioinform..