Performance, Accuracy, and Web Server for Evolutionary Placement of Short Sequence Reads under Maximum Likelihood

Abstract We present an evolutionary placement algorithm (EPA) and a Web server for the rapid assignment of sequence fragments (short reads) to edges of a given phylogenetic tree under the maximum-likelihood model. The accuracy of the algorithm is evaluated on several real-world data sets and compared with placement by pair-wise sequence comparison, using edit distances and BLAST. We introduce a slow and accurate as well as a fast and less accurate placement algorithm. For the slow algorithm, we develop additional heuristic techniques that yield almost the same run times as the fast version with only a small loss of accuracy. When those additional heuristics are employed, the run time of the more accurate algorithm is comparable with that of a simple BLAST search for data sets with a high number of short query sequences. Moreover, the accuracy of the EPA is significantly higher, in particular when the sample of taxa in the reference topology is sparse or inadequate. Our algorithm, which has been integrated into RAxML, therefore provides an equally fast but more accurate alternative to BLAST for tree-based inference of the evolutionary origin and composition of short sequence reads. We are also actively developing a Web server that offers a freely available service for computing read placements on trees using the EPA.

[1]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[2]  J. Farris,et al.  Quantitative Phyletics and the Evolution of Anurans , 1969 .

[3]  D. Anderson,et al.  Algorithms for minimization without derivatives , 1974 .

[4]  J. Felsenstein CONFIDENCE LIMITS ON PHYLOGENIES: AN APPROACH USING THE BOOTSTRAP , 1985, Evolution; international journal of organic evolution.

[5]  J. Bull,et al.  An Empirical Test of Bootstrapping as a Method for Assessing Confidence in Phylogenetic Analysis , 1993 .

[6]  Gapped BLAST and PSI-BLAST: A new , 1997 .

[7]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[8]  M. Gouy,et al.  Inferring pattern and process: maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. , 1998, Molecular biology and evolution.

[9]  J. Kim,et al.  Scaling of Accuracy in Extremely Large Phylogenetic Trees , 2000, Pacific Symposium on Biocomputing.

[10]  L. Koski,et al.  The Closest BLAST Hit Is Often Not the Nearest Neighbor , 2001, Journal of Molecular Evolution.

[11]  M. Ronaghi Pyrosequencing sheds light on DNA sequencing. , 2001, Genome research.

[12]  Tandy J. Warnow,et al.  Sequence-Length Requirements for Phylogenetic Methods , 2002, WABI.

[13]  K. Strimmer,et al.  Inferring confidence sets of possibly misspecified gene trees , 2002, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[14]  Faisal Ababneh,et al.  The biasing effect of compositional heterogeneity on phylogenetic estimates may be underestimated. , 2004, Systematic biology.

[15]  K. Schleifer,et al.  ARB: a software environment for sequence data. , 2004, Nucleic acids research.

[16]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[17]  F. Bäckhed,et al.  Obesity alters gut microbial ecology. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[18]  K. Katoh,et al.  MAFFT version 5: improvement in accuracy of multiple sequence alignment , 2005, Nucleic acids research.

[19]  R. Knight,et al.  UniFrac: a New Phylogenetic Method for Comparing Microbial Communities , 2005, Applied and Environmental Microbiology.

[20]  Lawrence Hunter,et al.  Pacific symposium on biocomputing 2006 , 2005, PSB 2016.

[21]  Thomas Ludwig,et al.  RAxML-III: a fast program for maximum likelihood-based inference of large phylogenetic trees , 2005, Bioinform..

[22]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[23]  R. Nielsen,et al.  Statistical approaches for DNA barcoding. , 2006, Systematic biology.

[24]  Philip Hugenholtz,et al.  NAST: a multiple sequence alignment server for comparative analysis of 16S rRNA genes , 2006, Nucleic Acids Res..

[25]  Faisal Ababneh,et al.  Matched-pairs tests of homogeneity with applications to homologous nucleotide sequences , 2006, Bioinform..

[26]  Alexandros Stamatakis,et al.  Phylogenetic models of rate heterogeneity: a high performance computing perspective , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[27]  Alexandros Stamatakis,et al.  RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models , 2006, Bioinform..

[28]  Scott R. Miller,et al.  Unexpected Diversity and Complexity of the Guerrero Negro Hypersaline Microbial Mat , 2006, Applied and Environmental Microbiology.

[29]  Yu Zhao,et al.  SeqVis: Visualization of compositional heterogeneity in large alignments of nucleotides , 2006, Bioinform..

[30]  D. Wagner,et al.  Methanogenic communities in permafrost-affected soils of the Laptev Sea coast, Siberian Arctic, characterized by 16S rRNA gene fingerprints. , 2007, FEMS microbiology ecology.

[31]  Vivek Jayaswal,et al.  Estimation of phylogeny and invariant sites under the general Markov model of nucleotide sequence evolution. , 2007, Systematic biology.

[32]  S. Tringe,et al.  Quantitative Phylogenetic Assessment of Microbial Communities in Diverse Environments , 2007, Science.

[33]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[34]  I. Rigoutsos,et al.  Accurate phylogenetic classification of variable-length DNA fragments , 2007, Nature Methods.

[35]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[36]  D. Alland,et al.  A detailed analysis of 16S ribosomal RNA gene segments for the diagnosis of pathogenic bacteria. , 2007, Journal of microbiological methods.

[37]  R. Knight,et al.  The influence of sex, handedness, and washing on the diversity of hand surface bacteria , 2008, Proceedings of the National Academy of Sciences.

[38]  J. Rougemont,et al.  A rapid bootstrap algorithm for the RAxML Web servers. , 2008, Systematic biology.

[39]  Wouter Boomsma,et al.  Statistical assignment of DNA sequences using Bayesian phylogenetics. , 2008, Systematic biology.

[40]  R. Knight,et al.  Worlds within worlds: evolution of the vertebrate gut microbiota , 2008, Nature Reviews Microbiology.

[41]  S. Salzberg,et al.  Phymm and PhymmBL: Metagenomic Phylogenetic Classification with Interpolated Markov Models , 2009, Nature Methods.

[42]  Christian M. Zmasek,et al.  phyloXML: XML for evolutionary biology and comparative genomics , 2009, BMC Bioinformatics.

[43]  B. Roe,et al.  A core gut microbiome in obese and lean twins , 2008, Nature.

[44]  Alexandros Stamatakis,et al.  Evolutionary placement of short sequence reads on multi-core architectures , 2010, ACS/IEEE International Conference on Computer Systems and Applications - AICCSA 2010.

[45]  E. Virginia Armbrust,et al.  pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree , 2010, BMC Bioinformatics.

[46]  Alexandros Stamatakis,et al.  Accuracy of morphology-based phylogenetic fossil placement under Maximum Likelihood , 2010, ACS/IEEE International Conference on Computer Systems and Applications - AICCSA 2010.

[47]  Harald Meier,et al.  46. ARB: A Software Environment for Sequence Data , 2011 .