Models of sequence evolution for DNA sequences containing gaps.

Most evolutionary tree estimation methods for DNA sequences ignore or inefficiently use the phylogenetic information contained within shared patterns of gaps. This is largely due to the computational difficulties in implementing models for insertions and deletions. A simple way to incorporate this information is to treat a gap as a fifth character (with the four nucleotides being the other four) and to incorporate it within a Markov model of nucleotide substitution. This idea has been dismissed in the past, since it treats a multiple-site insertion or deletion as a sequence of independent events rather than a single event. While this is true, we have found that under many circumstances it is better to incorporate gap information inadequately than to ignore it, at least for topology estimation. We propose an extension to a class of nucleotide substitution models to incorporate the gap character and show that, for data sets (both real and simulated) with short and medium gaps, these models do lead to effective use of the information contained within insertions and deletions. We also implement an ad hoc method in which the likelihood at columns containing multiple-site gaps is downweighted in order to avoid giving them undue influence. The precision of the estimated tree, assessed using Markov chain Monte Carlo techniques to find the posterior distribution over tree space, improves under these five-state models compared with standard methods which effectively ignore gaps.

[1]  J. Felsenstein,et al.  A Hidden Markov Model approach to variation among sites in rate of evolution. , 1996, Molecular biology and evolution.

[2]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[3]  G. Mitchison A Probabilistic Treatment of Phylogeny and Sequence Alignment , 1999, Journal of Molecular Evolution.

[4]  M T Clegg,et al.  Evolution of a noncoding region of the chloroplast genome. , 1993, Molecular phylogenetics and evolution.

[5]  Mark P. Simmons,et al.  Gaps as characters in sequence-based phylogenetic analyses. , 2000, Systematic biology.

[6]  B. Larget,et al.  Markov Chain Monte Carlo Algorithms for the Bayesian Analysis of Phylogenetic Trees , 2000 .

[7]  N. Saitou,et al.  Evolutionary rates of insertion and deletion in noncoding nucleotide sequences of primates. , 1994, Molecular biology and evolution.

[8]  G A Churchill,et al.  Estimation and reliability of molecular sequence alignments. , 1995, Biometrics.

[9]  R A Goldstein,et al.  Context-dependent optimal substitution matrices. , 1995, Protein engineering.

[10]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[11]  M A Newton,et al.  Bayesian Phylogenetic Inference via Markov Chain Monte Carlo Methods , 1999, Biometrics.

[12]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[13]  P. Holland,et al.  Gene and domain duplication in the chordate Otx gene family: insights from amphioxus Otx. , 1998, Molecular biology and evolution.

[14]  B. Efron,et al.  Bootstrap confidence levels for phylogenetic trees. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Lain L. MacDonald,et al.  Hidden Markov and Other Models for Discrete- valued Time Series , 1997 .

[16]  James E. Hixson,et al.  Comparisons of ape and human sequences that regulate mitochondrial DNA transcription and D-loop DNA synthesis , 1988, Nucleic Acids Res..

[17]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[18]  N. Maeda,et al.  Molecular evolution of intergenic DNA in higher primates: pattern of DNA changes, molecular clock, and evolution of repetitive sequences. , 1988, Molecular biology and evolution.

[19]  S. Chib,et al.  Understanding the Metropolis-Hastings Algorithm , 1995 .

[20]  M. Bishop,et al.  Maximum likelihood alignment of DNA sequences. , 1986, Journal of molecular biology.

[21]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.