Methods for automatic reference trees and multilevel phylogenetic placement

Abstract Motivation In most metagenomic sequencing studies, the initial analysis step consists in assessing the evolutionary provenance of the sequences. Phylogenetic (or Evolutionary) Placement methods can be employed to determine the evolutionary position of sequences with respect to a given reference phylogeny. These placement methods do however face certain limitations: The manual selection of reference sequences is labor-intensive; the computational effort to infer reference phylogenies is substantially larger than for methods that rely on sequence similarity; the number of taxa in the reference phylogeny should be small enough to allow for visually inspecting the results. Results We present algorithms to overcome the above limitations. First, we introduce a method to automatically construct representative sequences from databases to infer reference phylogenies. Second, we present an approach for conducting large-scale phylogenetic placements on nested phylogenies. Third, we describe a preprocessing pipeline that allows for handling huge sequence datasets. Our experiments on empirical data show that our methods substantially accelerate the workflow and yield highly accurate placement results. Availability and implementation Freely available under GPLv3 at http://github.com/lczech/gappa. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Hidetoshi Shimodaira An approximately unbiased test of phylogenetic tree selection. , 2002, Systematic biology.

[2]  Katherine H. Huang,et al.  Structure, Function and Diversity of the Healthy Human Microbiome , 2012, Nature.

[3]  Evan Bolton,et al.  Database resources of the National Center for Biotechnology Information , 2017, Nucleic Acids Res..

[4]  W. H. Day,et al.  Threshold consensus methods for molecular sequences. , 1992, Journal of theoretical biology.

[5]  L. Koski,et al.  The Closest BLAST Hit Is Often Not the Nearest Neighbor , 2001, Journal of Molecular Evolution.

[6]  Pelin Yilmaz,et al.  The SILVA ribosomal RNA gene database project: improved data processing and web-based tools , 2012, Nucleic Acids Res..

[7]  Jiajie Zhang,et al.  PEAR: a fast and accurate Illumina Paired-End reAd mergeR , 2013, Bioinform..

[8]  Alice C McHardy,et al.  Critical Assessment of Metagenome Interpretation Enters the Second Round , 2018, mSystems.

[9]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[10]  P. Bork,et al.  Eukaryotic plankton diversity in the sunlit ocean , 2015, Science.

[11]  Alexandros Stamatakis,et al.  Scalable Methods for Post-Processing, Visualizing, and Analyzing Phylogenetic Placements , 2018, bioRxiv.

[12]  P. Bayrak-Toydemir,et al.  Hereditary hemorrhagic telangiectasia: genetics and molecular diagnostics in a new era , 2015, Front. Genet..

[13]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[14]  H Philippe,et al.  Molecular phylogeny: pitfalls and progress. , 2000, International microbiology : the official journal of the Spanish Society for Microbiology.

[15]  J P Flandrois,et al.  16S rRNA sequencing in routine bacterial identification: a 30-month experiment. , 2006, Journal of microbiological methods.

[16]  D. Cavener,et al.  Comparison of the consensus sequence flanking translational start sites in Drosophila and vertebrates. , 1987, Nucleic acids research.

[17]  Yong Wang,et al.  An index of substitution saturation and its application. , 2003, Molecular phylogenetics and evolution.

[18]  E. Virginia Armbrust,et al.  pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree , 2010, BMC Bioinformatics.

[19]  A. von Haeseler,et al.  IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies , 2014, Molecular biology and evolution.

[20]  Pelin Yilmaz,et al.  Phylogeny-aware identification and correction of taxonomically mislabeled sequences , 2016, bioRxiv.

[21]  James R. Cole,et al.  Ribosomal Database Project: data and tools for high throughput rRNA analysis , 2013, Nucleic Acids Res..

[22]  Hidetoshi Shimodaira,et al.  Multiple Comparisons of Log-Likelihoods with Applications to Phylogenetic Inference , 1999, Molecular Biology and Evolution.

[23]  Pelin Yilmaz,et al.  The SILVA and “All-species Living Tree Project (LTP)” taxonomic frameworks , 2013, Nucleic Acids Res..

[24]  J. Chun,et al.  Introducing EzTaxon-e: a prokaryotic 16S rRNA gene sequence database with phylotypes that represent uncultured species. , 2012, International journal of systematic and evolutionary microbiology.

[25]  Alexandros Stamatakis,et al.  PaPaRa 2 . 0 : A Vectorized Algorithm for Probabilistic Phylogeny-Aware Alignment Extension , 2012 .

[26]  Alexandros Stamatakis,et al.  Aligning short reads to reference alignments and trees , 2011, Bioinform..

[27]  R. Henrik Nilsson,et al.  Global diversity and geography of soil fungi , 2014, Science.

[28]  Kenneth O. May,et al.  A Set of Independent Necessary and Sufficient Conditions for Simple Majority Decision , 1952 .

[29]  Francisco M. Cornejo-Castillo,et al.  Metagenomic 16S rDNA Illumina tags are a powerful alternative to amplicon sequencing to explore diversity and structure of microbial communities. , 2014, Environmental microbiology.

[30]  H. Rediers,et al.  Does Virulence Assessment of Vibrio anguillarum Using Sea Bass (Dicentrarchus labrax) Larvae Correspond with Genotypic and Phenotypic Characterization? , 2013, PloS one.

[31]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[32]  D. Huson,et al.  SILVA, RDP, Greengenes, NCBI and OTT — how do these taxonomies compare? , 2017, BMC Genomics.

[33]  H. Kishino,et al.  Maximum likelihood inference of protein phylogeny and the origin of chloroplasts , 1990, Journal of Molecular Evolution.

[34]  Benoit Morel,et al.  EPA-ng: Massively Parallel Evolutionary Placement of Genetic Sequences , 2018, bioRxiv.

[35]  Katherine H. Huang,et al.  A framework for human microbiome research , 2012, Nature.

[36]  Rick L. Stevens,et al.  A communal catalogue reveals Earth’s multiscale microbial diversity , 2017, Nature.

[37]  Susana Vinga,et al.  Information theory applications for biological sequence analysis , 2013, Briefings Bioinform..

[38]  R. Knight,et al.  UniFrac: a New Phylogenetic Method for Comparing Microbial Communities , 2005, Applied and Environmental Microbiology.

[39]  A. S.,et al.  Estimating the Entropy of DNA Sequences , 1997 .

[40]  Alexey M. Kozlov,et al.  EPA-ng: Massively Parallel Evolutionary Placement of Genetic Sequences , 2018 .

[41]  S. Abbott,et al.  16S rRNA Gene Sequencing for Bacterial Identification in the Diagnostic Laboratory: Pluses, Perils, and Pitfalls , 2007, Journal of Clinical Microbiology.

[42]  Ziheng Yang Statistical Properties of the Maximum Likelihood Method of Phylogenetic Estimation and Comparison With Distance Matrix Methods , 1994 .

[43]  Alexey M. Kozlov,et al.  Parasites dominate hyperdiverse soil protist communities in Neotropical rainforests , 2017, Nature Ecology &Evolution.

[44]  A. Sanchez-Flores,et al.  The Road to Metagenomics: From Microbiology to DNA Sequencing Technologies and Bioinformatics , 2015, Front. Genet..

[45]  Ben Nichols,et al.  Distributed under Creative Commons Cc-by 4.0 Vsearch: a Versatile Open Source Tool for Metagenomics , 2022 .

[46]  Claude E. Shannon,et al.  The Mathematical Theory of Communication. , 1951 .

[47]  Eoin L. Brodie,et al.  Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB , 2006, Applied and Environmental Microbiology.

[48]  D. Cavener,et al.  Eukaryotic start and stop translation sites. , 1991, Nucleic acids research.

[49]  S. Evans,et al.  The phylogenetic Kantorovich–Rubinstein metric for environmental sequence samples , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[50]  Sean R Eddy,et al.  A new generation of homology search tools based on probabilistic inference. , 2009, Genome informatics. International Conference on Genome Informatics.

[51]  Tandy J. Warnow,et al.  SEPP: SATe -Enabled Phylogenetic Placement , 2011, Pacific Symposium on Biocomputing.

[52]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[53]  Jeanne M. Marrazzo,et al.  Bacterial Communities in Women with Bacterial Vaginosis: High Resolution Phylogenetic Analyses Reveal Relationships of Microbiota to Clinical Criteria , 2012, PloS one.

[54]  Patrick J. F. Groenen,et al.  Modern Multidimensional Scaling: Theory and Applications , 2003 .

[55]  H. Kishino,et al.  Evaluation of the maximum likelihood estimate of the evolutionary tree topologies from DNA sequence data, and the branching order in hominoidea , 1989, Journal of Molecular Evolution.

[56]  W. H. Day,et al.  Critical comparison of consensus methods for molecular sequences. , 1992, Nucleic acids research.

[57]  Andy F. S. Taylor,et al.  The UNITE database for molecular identification of fungi--recent updates and future perspectives. , 2010, The New phytologist.

[58]  M A Krohn,et al.  Reliability of diagnosing bacterial vaginosis is improved by a standardized method of gram stain interpretation , 1991, Journal of clinical microbiology.

[59]  Alexandros Stamatakis,et al.  Placing environmental next-generation sequencing amplicons from microbial eukaryotes into a phylogenetic context. , 2014, Molecular biology and evolution.

[60]  Frederick A. Matsen IV,et al.  Edge Principal Components and Squash Clustering: Using the Special Structure of Phylogenetic Placement Data for Sample Comparison , 2011, PloS one.

[61]  Alexandros Stamatakis,et al.  RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies , 2014, Bioinform..

[62]  K. Strimmer,et al.  Inferring confidence sets of possibly misspecified gene trees , 2002, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[63]  Donovan H. Parks,et al.  A proposal for a standardized bacterial taxonomy based on genome phylogeny , 2018, bioRxiv.

[64]  Denis Krompass,et al.  Performance, Accuracy, and Web Server for Evolutionary Placement of Short Sequence Reads under Maximum Likelihood , 2011, Systematic biology.

[65]  Alexandros Stamatakis,et al.  Metagenomic species profiling using universal phylogenetic marker genes , 2013, Nature Methods.

[66]  Cathy A Petti,et al.  Medical Microbiology: Detection and Identification of Microorganisms by Gene Amplification and Sequencing , 2007 .

[67]  Philip D. Blood,et al.  Critical Assessment of Metagenome Interpretation—a benchmark of metagenomics software , 2017, Nature Methods.

[68]  David J. Edwards,et al.  Beginner’s guide to comparative bacterial genome analysis using next-generation sequence data , 2013, Microbial Informatics and Experimentation.

[69]  Stéphane Audic,et al.  The Protist Ribosomal Reference database (PR2): a catalog of unicellular eukaryote Small Sub-Unit rRNA sequences with curated taxonomy , 2012, Nucleic Acids Res..