Using ESTs for phylogenomics: Can one accurately infer a phylogenetic tree from a gappy alignment?

BackgroundWhile full genome sequences are still only available for a handful of taxa, large collections of partial gene sequences are available for many more. The alignment of partial gene sequences results in a multiple sequence alignment containing large gaps that are arranged in a staggered pattern. The consequences of this pattern of missing data on the accuracy of phylogenetic analysis are not well understood. We conducted a simulation study to determine the accuracy of phylogenetic trees obtained from gappy alignments using three commonly used phylogenetic reconstruction methods (Neighbor Joining, Maximum Parsimony, and Maximum Likelihood) and studied ways to improve the accuracy of trees obtained from such datasets.ResultsWe found that the pattern of gappiness in multiple sequence alignments derived from partial gene sequences substantially compromised phylogenetic accuracy even in the absence of alignment error. The decline in accuracy was beyond what would be expected based on the amount of missing data. The decline was particularly dramatic for Neighbor Joining and Maximum Parsimony, where the majority of gappy alignments contained 25% to 40% incorrect quartets. To improve the accuracy of the trees obtained from a gappy multiple sequence alignment, we examined two approaches. In the first approach, alignment masking, potentially problematic columns and input sequences are excluded from from the dataset. Even in the absence of alignment error, masking improved phylogenetic accuracy up to 100-fold. However, masking retained, on average, only 83% of the input sequences. In the second approach, alignment subdivision, the missing data is statistically modelled in order to retain as many sequences as possible in the phylogenetic analysis. Subdivision resulted in more modest improvements to alignment accuracy, but succeeded in including almost all of the input sequences.ConclusionThese results demonstrate that partial gene sequences and gappy multiple sequence alignments can pose a major problem for phylogenetic analysis. The concern will be greatest for high-throughput phylogenomic analyses, in which Neighbor Joining is often the preferred method due to its computational efficiency. Both approaches can be used to increase the accuracy of phylogenetic inference from a gappy alignment. The choice between the two approaches will depend upon how robust the application is to the loss of sequences from the input set, with alignment masking generally giving a much greater improvement in accuracy but at the cost of discarding a larger number of the input sequences.

[1]  Erik L. L. Sonnhammer,et al.  Automated ortholog inference from phylogenetic trees and calculation of orthology reliability , 2002, Bioinform..

[2]  Thomas Ludwig,et al.  RAxML-III: a fast program for maximum likelihood-based inference of large phylogenetic trees , 2005, Bioinform..

[3]  M. Rosenberg,et al.  Multiple sequence alignment accuracy and phylogenetic inference. , 2006, Systematic biology.

[4]  Fred R. McMorris,et al.  COMPARISON OF UNDIRECTED PHYLOGENETIC TREES BASED ON SUBTREES OF FOUR EVOLUTIONARY UNITS , 1985 .

[5]  F. Delsuc,et al.  Phylogenomics: the beginning of incongruence? , 2006, Trends in genetics : TIG.

[6]  Rob DeSalle,et al.  ESTimating plant phylogeny: lessons from partitioning , 2006, BMC Evolutionary Biology.

[7]  Gerard Talavera,et al.  Improvement of phylogenies after removing divergent and ambiguously aligned blocks from protein sequence alignments. , 2007, Systematic biology.

[8]  M. Suchard,et al.  Alignment Uncertainty and Genomic Analysis , 2008, Science.

[9]  J A Eisen,et al.  Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. , 1998, Genome research.

[10]  F. Lapointe,et al.  Total evidence, consensus, and bat phylogeny: A distance-based approach. , 1999, Molecular phylogenetics and evolution.

[11]  Todd J. Vision,et al.  Phytome: a platform for plant comparative genomics , 2005, Nucleic Acids Res..

[12]  C. Bron,et al.  Algorithm 457: finding all cliques of an undirected graph , 1973 .

[13]  T. Vision,et al.  The molecular ecologist's guide to expressed sequence tags , 2006, Molecular ecology.

[14]  Sean R. Eddy,et al.  A simple algorithm to infer gene duplication and speciation events on a gene tree , 2001, Bioinform..

[15]  J. Handelsman Metagenomics: Application of Genomics to Uncultured Microorganisms , 2004, Microbiology and Molecular Biology Reviews.

[16]  S. Carroll,et al.  Genome-scale approaches to resolving incongruence in molecular phylogenies , 2003, Nature.

[17]  O. Bininda-Emonds,et al.  The evolution of supertrees. , 2004, Trends in ecology & evolution.

[18]  J. Wiens,et al.  Missing data, incomplete taxa, and phylogenetic accuracy. , 2003, Systematic biology.

[19]  Folker Meyer,et al.  Rose: generating sequence families , 1998, Bioinform..

[20]  H. Shaffer,et al.  Annual review of ecology, evolution, and systematics , 2003 .

[21]  Makoto Kato,et al.  COSPECIATION ANALYSIS OF AN OBLIGATE POLLINATION MUTUALISM: HAVEGLOCHIDION TREES (EUPHORBIACEAE) AND POLLINATING EPICEPHALA MOTHS(GRACILLARIIDAE) DIVERSIFIED IN PARALLEL? , 2004, Evolution; international journal of organic evolution.

[22]  F. Lapointe,et al.  Estimating Phylogenies from Lacunose Distance Matrices: Additive is Superior to Ultrametric Estimation , 1996 .

[23]  P. Wagner,et al.  Integrating ambiguously aligned regions of DNA sequences in phylogenetic analyses without violating positional homology. , 2000, Systematic biology.

[24]  Alexandros Stamatakis,et al.  An efficient program for phylogenetic inference using simulated annealing , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[25]  J. G. Burleigh,et al.  Prospects for Building the Tree of Life from Large Sequence Databases , 2004, Science.

[26]  G. Soete Ultrametric tree representations of incomplete dissimilarity data , 1984 .

[27]  Thomas Mailund,et al.  Fast calculation of the quartet distance between trees of arbitrary degrees , 2006, Algorithms for Molecular Biology.

[28]  P. Waddell,et al.  Measuring the fit of sequence data to phylogenetic model: allowing for missing data. , 2005, Molecular biology and evolution.

[29]  J. Huelsenbeck Performance of Phylogenetic Methods in Simulation , 1995 .

[30]  Roderic D. M. Page,et al.  Vertebrate Phylogenomics: Reconciled Trees and Gene Duplications , 2001, Pacific Symposium on Biocomputing.

[31]  Kimmen Sjölander,et al.  Phylogenomic inference of protein molecular function: advances and challenges , 2004, Bioinform..

[32]  M. P. Cummings PHYLIP (Phylogeny Inference Package) , 2004 .

[33]  J S Anderson,et al.  The phylogenetic trunk: maximal inclusion of taxa with missing data in an analysis of the lepospondyli (Vertebrata, Tetrapoda). , 2001, Systematic biology.

[34]  W C Wheeler,et al.  Elision: a method for accommodating multiple molecular sequence alignments with alignment-ambiguous sites. , 1995, Molecular phylogenetics and evolution.

[35]  François-Joseph Lapointe,et al.  A weighted least-squares approach for inferring phylogenies from incomplete distance matrices , 2004, Bioinform..

[36]  G. Giribet,et al.  TNT: Tree Analysis Using New Technology , 2005 .

[37]  Olivier Gascuel,et al.  SDM: a fast distance-based approach for (super) tree building in phylogenomics. , 2006, Systematic biology.

[38]  George F. Estabrook,et al.  Evaluating Undirected Positional Congruence of Individual Taxa Between Two Estimates of the Phylogenetic Tree for a Group of Taxa , 1992 .

[39]  John J. Wiens,et al.  Missing data and the design of phylogenetic analyses , 2006, J. Biomed. Informatics.

[40]  Vladimir Makarenkov,et al.  Incomplete distance matrices, supertrees and bat phylogeny. , 2003, Molecular phylogenetics and evolution.

[41]  J. William,et al.  Combining data in phylogenetic analysis. , 1996, Trends in ecology & evolution.

[42]  S. Rudd Expressed sequence tags: alternative or complement to whole genome sequences? , 2003, Trends in plant science.

[43]  David Bryant,et al.  Accounting for gene rate heterogeneity in phylogenetic inference. , 2007, Systematic biology.

[44]  Makoto Kato,et al.  An obligate pollination mutualism and reciprocal diversification in the tree genus Glochidion (Euphorbiaceae) , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[45]  J. Wiens,et al.  INCOMPLETE TAXA, INCOMPLETE CHARACTERS, AND PHYLOGENETIC ACCURACY: IS THERE A MISSING DATA PROBLEM? , 2003 .

[46]  Sudhir Kumar,et al.  Multiple sequence alignment: in pursuit of homologous DNA positions. , 2007, Genome research.

[47]  P. Holland,et al.  Phylogenomics of eukaryotes: impact of missing data on large alignments. , 2004, Molecular biology and evolution.

[48]  Michael J Sanderson,et al.  The challenge of constructing large phylogenetic trees. , 2003, Trends in plant science.

[49]  Vladimir Makarenkov,et al.  A new effective method for estimating missing values in the sequence data prior to phylogenetic analysis , 2006, Evolutionary bioinformatics online.

[50]  R DeSalle,et al.  Alignment-ambiguous nucleotide sites and the exclusion of systematic data. , 1993, Molecular phylogenetics and evolution.