Aligning short reads to reference alignments and trees

MOTIVATION Likelihood-based methods for placing short read sequences from metagenomic samples into reference phylogenies have been recently introduced. At present, it is unclear how to align those reads with respect to the reference alignment that was deployed to infer the reference phylogeny. Moreover, the adaptability of such alignment methods with respect to the underlying reference alignment strategies/philosophies has not been explored. It has also not been assessed if the reference phylogeny can be deployed in conjunction with the reference alignment to improve alignment accuracy in this context. RESULTS We assess different strategies for short read alignment and propose a novel phylogeny-aware alignment procedure. Our alignment method can improve the accuracy of subsequent phylogenetic placement of the reads into a reference phylogeny by up to 5.8 times compared with phylogeny-agnostic methods. It can be deployed to align reads to alignments generated by using fundamentally different alignment strategies (e.g. PRANK(+F) versus MUSCLE). AVAILABILITY http://www.exelixis-lab.org/software.html

[1]  W. Fitch,et al.  Construction of phylogenetic trees. , 1967, Science.

[2]  References , 1971 .

[3]  D. Sankoff Minimal Mutation Trees of Sequences , 1975 .

[4]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[5]  O. Gotoh An improved algorithm for matching biological sequences. , 1982, Journal of molecular biology.

[6]  X. Huang,et al.  A contig assembly program based on sensitive detection of fragment overlaps. , 1992, Genomics.

[7]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[8]  J. Felsenstein,et al.  Inching toward reality: An improved likelihood model of sequence evolution , 2004, Journal of Molecular Evolution.

[9]  Ziheng Yang Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods , 1994, Journal of Molecular Evolution.

[10]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[11]  K. Katoh,et al.  MAFFT version 5: improvement in accuracy of multiple sequence alignment , 2005, Nucleic acids research.

[12]  Alexandros Stamatakis,et al.  RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models , 2006, Bioinform..

[13]  Michael Farrar,et al.  Sequence analysis Striped Smith – Waterman speeds database searches six times over other SIMD implementations , 2007 .

[14]  R. Knight,et al.  The influence of sex, handedness, and washing on the diversity of hand surface bacteria , 2008, Proceedings of the National Academy of Sciences.

[15]  A. Löytynoja,et al.  Phylogeny-Aware Gap Placement Prevents Errors in Sequence Alignment and Evolutionary Analysis , 2008, Science.

[16]  Forest Rohwer,et al.  The GAAS Metagenomic Tool and Its Estimations of Viral and Microbial Average Genome Size in Four Major Biomes , 2009, PLoS Comput. Biol..

[17]  B. Roe,et al.  A core gut microbiome in obese and lean twins , 2008, Nature.

[18]  Inge Jonassen,et al.  Characteristics of 454 pyrosequencing data—enabling realistic simulation with flowsim , 2010, Bioinform..

[19]  A. Stamatakis,et al.  MLTreeMap - accurate Maximum Likelihood placement of environmental DNA sequences into taxonomic and functional reference phylogenies , 2010, BMC Genomics.

[20]  E. Virginia Armbrust,et al.  pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree , 2010, BMC Bioinformatics.

[21]  Denis Krompass,et al.  Performance, Accuracy, and Web Server for Evolutionary Placement of Short Sequence Reads under Maximum Likelihood , 2011, Systematic biology.