Rapid alignment-free phylogenetic identification of metagenomic sequences

Motivation Taxonomic classification is at the core of environmental DNA analysis. When a phylogenetic tree can be built as a prior hypothesis to such classification, phylogenetic placement (PP) provides the most informative type of classification because each query sequence is assigned to its putative origin in the tree. This is useful whenever precision is sought (e.g. in diagnostics). However, likelihood-based PP algorithms struggle to scale with the ever-increasing throughput of DNA sequencing. Results We have developed RAPPAS (Rapid Alignment-free Phylogenetic Placement via Ancestral Sequences) which uses an alignment-free approach, removing the hurdle of query sequence alignment as a preliminary step to PP. Our approach relies on the precomputation of a database of k-mers that may be present with non-negligible probability in relatives of the reference sequences. The placement is performed by inspecting the stored phylogenetic origins of the k-mers in the query, and their probabilities. The database can be reused for the analysis of several different metagenomes. Experiments show that the first implementation of RAPPAS is already faster than competing likelihood-based PP algorithms, while keeping similar accuracy for short reads. RAPPAS scales PP for the era of routine metagenomic diagnostics. Availability Program and sources freely available for download at https://github.com/blinard-BIOINFO/RAPPAS. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Frédéric Mahé,et al.  Swarm: robust and fast clustering method for amplicon-based studies , 2014, PeerJ.

[2]  Frederick A. Matsen IV,et al.  Edge Principal Components and Squash Clustering: Using the Special Structure of Phylogenetic Placement Data for Sample Comparison , 2011, PloS one.

[3]  S. Evans,et al.  The phylogenetic Kantorovich–Rubinstein metric for environmental sequence samples , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[4]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[5]  K. Deforche,et al.  An automated genotyping tool for enteroviruses and noroviruses. , 2011, Journal of clinical virology : the official publication of the Pan American Society for Clinical Virology.

[6]  Arwyn Edwards,et al.  Extreme metagenomics using nanopore DNA sequencing : a field report from Svalbard , 78 ° N , 2016 .

[7]  Hamidreza Chitsaz,et al.  HyDA-Vista: towards optimal guided selection of k-mer size for sequence assembly , 2014, BMC Genomics.

[8]  Eoin L. Brodie,et al.  Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB , 2006, Applied and Environmental Microbiology.

[9]  T. Glenn Field guide to next‐generation DNA sequencers , 2011, Molecular ecology resources.

[10]  Alexandros Stamatakis,et al.  Algorithms, data structures, and numerics for likelihood-based phylogenetic inference of huge trees , 2011, BMC Bioinformatics.

[11]  Matthew W. Brown,et al.  EukRef: Phylogenetic curation of ribosomal RNA to enhance understanding of eukaryotic diversity and distribution , 2018, bioRxiv.

[12]  Peer Bork,et al.  Interactive tree of life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other trees , 2016, Nucleic Acids Res..

[13]  Frederick A. Matsen,et al.  Abundance-weighted phylogenetic diversity measures distinguish microbial community states and are robust to sampling depth , 2013, PeerJ.

[14]  Alban Caporossi,et al.  Hepatitis C virus whole genome sequencing: Current methods/issues and future challenges , 2016, Critical reviews in clinical laboratory sciences.

[15]  Koichiro Tamura,et al.  Phylogenetic placement of metagenomic reads using the minimum evolution principle , 2015, BMC Genomics.

[16]  Pelin Yilmaz,et al.  The SILVA and “All-species Living Tree Project (LTP)” taxonomic frameworks , 2013, Nucleic Acids Res..

[17]  Jennifer L. Gardy,et al.  Towards a genomics-informed, real-time, global pathogen surveillance system , 2017, Nature Reviews Genetics.

[18]  M. Nei,et al.  A new method of inference of ancestral nucleotide and amino acid sequences. , 1995, Genetics.

[19]  Klas Hatje,et al.  Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches , 2014, Nucleic Acids Res..

[20]  K. Kupkova,et al.  Bioinformatics strategies for taxonomy independent binning and visualization of sequences in shotgun metagenomics , 2016, Computational and structural biotechnology journal.

[21]  Pelin Yilmaz,et al.  The SILVA ribosomal RNA gene database project: improved data processing and web-based tools , 2012, Nucleic Acids Res..

[22]  S. Lonardi,et al.  CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers , 2015, BMC Genomics.

[23]  Francesca Giordano,et al.  Oxford Nanopore MinION Sequencing and Genome Assembly , 2016, Genom. Proteom. Bioinform..

[24]  E. Virginia Armbrust,et al.  pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree , 2010, BMC Bioinformatics.

[25]  James R. Cole,et al.  Ribosomal Database Project: data and tools for high throughput rRNA analysis , 2013, Nucleic Acids Res..

[26]  Alexandros Stamatakis,et al.  Aligning short reads to reference alignments and trees , 2011, Bioinform..

[27]  Benoit Morel,et al.  EPA-ng: Massively Parallel Evolutionary Placement of Genetic Sequences , 2018, bioRxiv.

[28]  Deepak Sharma,et al.  Unraveling the Web of Viroinformatics: Computational Tools and Databases in Virus Research , 2014, Journal of Virology.

[29]  Kristy Deiner,et al.  Environmental DNA metabarcoding: Transforming how we survey animal and plant communities , 2017, Molecular ecology.

[30]  Manuel Holtgrewe,et al.  Mason – A Read Simulator for Second Generation Sequencing Data , 2010 .

[31]  Thomas M. Keane,et al.  The European Nucleotide Archive in 2017 , 2017, Nucleic Acids Res..

[32]  Ziheng Yang PAML 4: phylogenetic analysis by maximum likelihood. , 2007, Molecular biology and evolution.

[33]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..

[34]  Jana Batovska,et al.  Metagenomic arbovirus detection using MinION nanopore sequencing. , 2017, Journal of virological methods.

[35]  Maya Gokhale,et al.  Scalable metagenomic taxonomy classification using a reference genome database , 2013, Bioinform..

[36]  Daniel G. Brown,et al.  LSHPlace: Fast Phylogenetic Placement Using Locality-Sensitive Hashing , 2012, Pacific Symposium on Biocomputing.

[37]  Frederick Albert Matsen IV,et al.  A Format for Phylogenetic Placements , 2012, PloS one.

[38]  Brian D. Ondov,et al.  Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[39]  M-J Butel,et al.  Probiotics, gut microbiota and health. , 2014, Medecine et maladies infectieuses.

[40]  M. Gilbert,et al.  Documenting DNA in the dust , 2017, Molecular ecology.

[41]  Ye Yu,et al.  A novel data structure to support ultra-fast taxonomic classification of metagenomic sequences with k-mer signatures , 2017, Bioinform..

[42]  O. Gascuel,et al.  New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. , 2010, Systematic biology.

[43]  Gregory Kucherov,et al.  Spaced seeds improve k-mer-based metagenomic classification , 2015, Bioinform..

[44]  Jesse J. Salk,et al.  Enhancing the accuracy of next-generation sequencing for detecting rare and subclonal mutations , 2018, Nature Reviews Genetics.

[45]  Bertil Schmidt,et al.  MetaCache: context-aware classification of metagenomic reads using minhashing , 2017, Bioinform..

[46]  Erik L. Hewlett,et al.  Whole-Genome Sequencing in Outbreak Analysis , 2015, Clinical Microbiology Reviews.

[47]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[48]  Daniel H. Huson,et al.  MEGAN Community Edition - Interactive Exploration and Analysis of Large-Scale Microbiome Sequencing Data , 2016, PLoS Comput. Biol..

[49]  T. Porter,et al.  Scaling up: A guide to high‐throughput genomic approaches for biodiversity analysis , 2018, Molecular ecology.

[50]  Stéphane Audic,et al.  PhytoREF: a reference database of the plastidial 16S rRNA gene of photosynthetic eukaryotes with curated taxonomy , 2015, Molecular ecology resources.

[51]  Rick L. Stevens,et al.  A communal catalogue reveals Earth’s multiscale microbial diversity , 2017, Nature.

[52]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[53]  Denis Krompass,et al.  Performance, Accuracy, and Web Server for Evolutionary Placement of Short Sequence Reads under Maximum Likelihood , 2011, Systematic biology.

[54]  Yaniv Erlich,et al.  Using mobile sequencers in an academic classroom , 2016, eLife.

[55]  Carreño Carreño,et al.  Evaluación de la diversidad taxonómica y funcional de la comunidad microbiana relacionada con el ciclo del nitrógeno en suelos de cultivo de arroz con diferentes manejos del tamo , 2020 .