Phylogenetic inference under varying proportions of indel-induced alignment gaps

BackgroundThe effect of alignment gaps on phylogenetic accuracy has been the subject of numerous studies. In this study, we investigated the relationship between the total number of gapped sites and phylogenetic accuracy, when the gaps were introduced (by means of computer simulation) to reflect indel (insertion/deletion) events during the evolution of DNA sequences. The resulting (true) alignments were subjected to commonly used gap treatment and phylogenetic inference methods.Results(1) In general, there was a strong – almost deterministic – relationship between the amount of gap in the data and the level of phylogenetic accuracy when the alignments were very "gappy", (2) gaps resulting from deletions (as opposed to insertions) contributed more to the inaccuracy of phylogenetic inference, (3) the probabilistic methods (Bayesian, PhyML & "MLε, " a method implemented in DNAML in PHYLIP) performed better at most levels of gap percentage when compared to parsimony (MP) and distance (NJ) methods, with Bayesian analysis being clearly the best, (4) methods that treat gapped sites as missing data yielded less accurate trees when compared to those that attribute phylogenetic signal to the gapped sites (by coding them as binary character data – presence/absence, or as in the MLε method), and (5) in general, the accuracy of phylogenetic inference depended upon the amount of available data when the gaps resulted from mainly deletion events, and the amount of missing data when insertion events were equally likely to have caused the alignment gaps.ConclusionWhen gaps in an alignment are a consequence of indel events in the evolution of the sequences, the accuracy of phylogenetic analysis is likely to improve if: (1) alignment gaps are categorized as arising from insertion events or deletion events and then treated separately in the analysis, (2) the evolutionary signal provided by indels is harnessed in the phylogenetic analysis, and (3) methods that utilize the phylogenetic signal in indels are developed for distance methods too. When the true homology is known and the amount of gaps is 20 percent of the alignment length or less, the methods used in this study are likely to yield trees with 90–100 percent accuracy.

[1]  J. Parsch Selective constraints on intron evolution in Drosophila. , 2003, Genetics.

[2]  A. Driskell,et al.  Phylogeny and evolution of the Australo-Papuan honeyeaters (Passeriformes, Meliphagidae). , 2004, Molecular phylogenetics and evolution.

[3]  John J. Wiens,et al.  Missing data and the design of phylogenetic analyses , 2006, J. Biomed. Informatics.

[4]  D. Swofford PAUP*: Phylogenetic analysis using parsimony (*and other methods), Version 4.0b10 , 2002 .

[5]  Barry G. Hall,et al.  Phylogenetic Trees Made Easy: A How-To Manual for Molecular Biologists , 2001 .

[6]  A. Löytynoja,et al.  Phylogeny-Aware Gap Placement Prevents Errors in Sequence Alignment and Evolutionary Analysis , 2008, Science.

[7]  Kentaro Yano,et al.  Pattern and rate of indel evolution inferred from whole chloroplast intergenic regions in sugarcane, maize and rice. , 2006, DNA research : an international journal for rapid publication of reports on genes and genomes.

[8]  J. Wen,et al.  Phylogeny of Panax using chloroplast trnC-trnD intergenic region and the utility of trnC-trnD in interspecific studies of plants. , 2004, Molecular phylogenetics and evolution.

[9]  P. Holland,et al.  Phylogenomics of eukaryotes: impact of missing data on large alignments. , 2004, Molecular biology and evolution.

[10]  M. Nei,et al.  Molecular Evolution and Phylogenetics , 2000 .

[11]  Zih E N G Ya N,et al.  On the Best Evolutionary Rate for Phylogenetic Analysis , 1998 .

[12]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.

[13]  K. Crandall,et al.  Incorporating gaps as phylogenetic characters across eight DNA regions: ramifications for North American Psoraleeae (Leguminosae). , 2008, Molecular phylogenetics and evolution.

[14]  John P. Huelsenbeck,et al.  MRBAYES: Bayesian inference of phylogenetic trees , 2001, Bioinform..

[15]  Stefanie Hartmann,et al.  Using ESTs for phylogenomics: Can one accurately infer a phylogenetic tree from a gappy alignment? , 2008, BMC Evolutionary Biology.

[16]  Elena Rivas,et al.  Evolutionary models for insertions and deletions in a probabilistic modeling framework , 2005, BMC Bioinformatics.

[17]  Sandhya Dwarkadas,et al.  Parallel Metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference , 2002, Bioinform..

[18]  T. J. Robinson,et al.  Indel evolution of mammalian introns and the utility of non-coding nuclear markers in eutherian phylogenetics. , 2007, Molecular phylogenetics and evolution.

[19]  Z. Yang,et al.  Among-site rate variation and its impact on phylogenetic analyses. , 1996, Trends in ecology & evolution.

[20]  W. Pearson,et al.  Exploring the relationship between sequence similarity and accurate phylogenetic trees. , 2006, Molecular biology and evolution.

[21]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[22]  M. Rosenberg,et al.  Multiple sequence alignment accuracy and phylogenetic inference. , 2006, Systematic biology.

[23]  M. Rosenberg,et al.  How should gaps be treated in parsimony? A comparison of approaches using simulation. , 2007, Molecular phylogenetics and evolution.

[24]  S. Schaeffer Molecular population genetics of sequence length diversity in the Adh region of Drosophila pseudoobscura. , 2002, Genetical research.

[25]  Feng-Chi Chen,et al.  Human-specific insertions and deletions inferred from mammalian genome sequences. , 2006, Genome research.

[26]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[27]  Edward C. Holmes,et al.  Phylogenetic profiles reveal evolutionary relationships within the “twilight zone” of sequence similarity , 2008, Proceedings of the National Academy of Sciences.

[28]  J. Wiens,et al.  Missing data, incomplete taxa, and phylogenetic accuracy. , 2003, Systematic biology.

[29]  Reed A. Cartwright,et al.  DNA assembly with gaps (Dawg): simulating sequence evolution , 2005, Bioinform..

[30]  M. Nei,et al.  Efficiencies of fast algorithms of phylogenetic inference under the criteria of maximum parsimony, minimum evolution, and maximum likelihood when a large number of sequences are used. , 2000, Molecular biology and evolution.

[31]  M. Rosenberg,et al.  Traditional phylogenetic reconstruction methods reconstruct shallow and deep evolutionary relationships equally well. , 2001, Molecular biology and evolution.

[32]  J. Peter Gogarten,et al.  Whole-Genome Analysis of Photosynthetic Prokaryotes , 2002, Science.

[33]  N. N. Voront︠s︡ov,et al.  The Use of Tree Comparison Metrics , 1985 .

[34]  Elena Rivas,et al.  Probabilistic Phylogenetic Inference with Insertions and Deletions , 2008, PLoS Comput. Biol..

[35]  Mark Gerstein,et al.  Patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes. , 2003, Nucleic acids research.

[36]  Mark P. Simmons,et al.  The relative performance of indel-coding methods in simulations. , 2007, Molecular phylogenetics and evolution.

[37]  A. Graybeal,et al.  Is it better to add taxa or characters to a difficult phylogenetic problem? , 1998, Systematic biology.

[38]  N. Saitou,et al.  Evolutionary rates of insertion and deletion in noncoding nucleotide sequences of primates. , 1994, Molecular biology and evolution.

[39]  J. Wiens,et al.  Missing data and the accuracy of Bayesian phylogenetics , 2008 .

[40]  Ziheng Yang PAML 4: phylogenetic analysis by maximum likelihood. , 2007, Molecular biology and evolution.

[41]  Sudhindra R Gadagkar,et al.  Inferring species phylogenies from multiple genes: concatenated sequence tree versus consensus gene tree. , 2005, Journal of experimental zoology. Part B, Molecular and developmental evolution.

[42]  Helen Piontkivska,et al.  Efficiencies of maximum likelihood methods of phylogenetic inferences when different substitution models are used. , 2004, Molecular phylogenetics and evolution.

[43]  Ziheng Yang,et al.  PAML: a program package for phylogenetic analysis by maximum likelihood , 1997, Comput. Appl. Biosci..

[44]  John P. Huelsenbeck,et al.  MrBayes 3: Bayesian phylogenetic inference under mixed models , 2003, Bioinform..