Rapid alignment-free phylogenetic identification of metagenomic sequences

Motivation Taxonomic classification is at the core of environmental DNA analysis. When a phylogenetic tree can be built as a prior hypothesis to such classification, “Phylogenetic placement” (PP) provides the most informative type of classification because each query sequence is assigned to its putative origin in the tree. This is useful whenever precision is sought (e.g. in diagnostics). However, likelihood-based PP algorithms struggle to scale with the ever increasing throughputs of DNA sequencing. Results We have developed RAPPAS (Rapid Alignment-free Phylogenetic Placement via Ancestral Sequences) which uses an alignment-free approach removing the hurdle of query sequence alignment as a preliminary step to PP. Our approach relies on the precomputation of a database of k-mers present with non-negligible probability in the relatives of the reference sequences. The placement is performed by inspecting the stored phylogenetic origins of the k-mers in the query, and their probabilities. The database can be reused for the analysis of several different metagenomes. Experiments show that the first implementation of RAPPAS is already faster than the previous likelihood-based PP algorithms, while keeping similar accuracy for short reads. RAPPAS scales PP to the era of routine metagenomic diagnostics. Availability Program and sources freely available for download at gite.lirmm.fr/linard/RAPPAS. Contact benjamin.linard@lirmm.fr

[1]  James R. Cole,et al.  Ribosomal Database Project: data and tools for high throughput rRNA analysis , 2013, Nucleic Acids Res..

[2]  D. Penny Inferring Phylogenies.—Joseph Felsenstein. 2003. Sinauer Associates, Sunderland, Massachusetts. , 2004 .

[3]  Thomas M. Keane,et al.  The European Nucleotide Archive in 2017 , 2017, Nucleic Acids Res..

[4]  Manuel Holtgrewe,et al.  Mason – A Read Simulator for Second Generation Sequencing Data , 2010 .

[5]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[6]  Erik L. Hewlett,et al.  Whole-Genome Sequencing in Outbreak Analysis , 2015, Clinical Microbiology Reviews.

[7]  Ziheng Yang PAML 4: phylogenetic analysis by maximum likelihood. , 2007, Molecular biology and evolution.

[8]  Daniel H. Huson,et al.  MEGAN Community Edition - Interactive Exploration and Analysis of Large-Scale Microbiome Sequencing Data , 2016, PLoS Comput. Biol..

[9]  Rick L. Stevens,et al.  A communal catalogue reveals Earth’s multiscale microbial diversity , 2017, Nature.

[10]  Yaniv Erlich,et al.  Using mobile sequencers in an academic classroom , 2016, eLife.

[11]  O. Gascuel,et al.  New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. , 2010, Systematic biology.

[12]  Alexandros Stamatakis,et al.  Algorithms, data structures, and numerics for likelihood-based phylogenetic inference of huge trees , 2011, BMC Bioinformatics.

[13]  Frédéric Mahé,et al.  Swarm: robust and fast clustering method for amplicon-based studies , 2014, PeerJ.

[14]  Alban Caporossi,et al.  Hepatitis C virus whole genome sequencing: Current methods/issues and future challenges , 2016, Critical reviews in clinical laboratory sciences.

[15]  Pelin Yilmaz,et al.  The SILVA ribosomal RNA gene database project: improved data processing and web-based tools , 2012, Nucleic Acids Res..

[16]  Klas Hatje,et al.  Spaced words and kmacs: fast alignment-free sequence comparison based on inexact word matches , 2014, Nucleic Acids Res..

[17]  Arwyn Edwards,et al.  Extreme metagenomics using nanopore DNA sequencing : a field report from Svalbard , 78 ° N , 2016 .

[18]  Benoit Morel,et al.  EPA-ng: Massively Parallel Evolutionary Placement of Genetic Sequences , 2018, bioRxiv.

[19]  Stéphane Audic,et al.  PhytoREF: a reference database of the plastidial 16S rRNA gene of photosynthetic eukaryotes with curated taxonomy , 2015, Molecular ecology resources.

[20]  Jennifer L. Gardy,et al.  Towards a genomics-informed, real-time, global pathogen surveillance system , 2017, Nature Reviews Genetics.

[21]  M. Gilbert,et al.  Documenting DNA in the dust , 2017, Molecular ecology.

[22]  E. Virginia Armbrust,et al.  pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree , 2010, BMC Bioinformatics.

[23]  Bertil Schmidt,et al.  MetaCache: context-aware classification of metagenomic reads using minhashing , 2017, Bioinform..

[24]  Frederick A. Matsen,et al.  Abundance-weighted phylogenetic diversity measures distinguish microbial community states and are robust to sampling depth , 2013, PeerJ.

[25]  Maya Gokhale,et al.  Scalable metagenomic taxonomy classification using a reference genome database , 2013, Bioinform..

[26]  Jesse J. Salk,et al.  Enhancing the accuracy of next-generation sequencing for detecting rare and subclonal mutations , 2018, Nature Reviews Genetics.

[27]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..

[28]  Jana Batovska,et al.  Metagenomic arbovirus detection using MinION nanopore sequencing. , 2017, Journal of virological methods.

[29]  S. Lonardi,et al.  CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers , 2015, BMC Genomics.

[30]  Denis Krompass,et al.  Performance, Accuracy, and Web Server for Evolutionary Placement of Short Sequence Reads under Maximum Likelihood , 2011, Systematic biology.

[31]  Alexandros Stamatakis,et al.  Aligning short reads to reference alignments and trees , 2011, Bioinform..

[32]  Brian D. Ondov,et al.  Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[33]  Peer Bork,et al.  Interactive tree of life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other trees , 2016, Nucleic Acids Res..

[34]  Daniel G. Brown,et al.  LSHPlace: Fast Phylogenetic Placement Using Locality-Sensitive Hashing , 2012, Pacific Symposium on Biocomputing.

[35]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[36]  Frederick Albert Matsen IV,et al.  A Format for Phylogenetic Placements , 2012, PloS one.

[37]  M. Nei,et al.  A new method of inference of ancestral nucleotide and amino acid sequences. , 1995, Genetics.

[38]  Mehrdad Hajibabaei,et al.  Biomonitoring 2.0: a new paradigm in ecosystem assessment made possible by next‐generation DNA sequencing , 2012, Molecular ecology.

[39]  Koichiro Tamura,et al.  Phylogenetic placement of metagenomic reads using the minimum evolution principle , 2015, BMC Genomics.

[40]  K. Kupkova,et al.  Bioinformatics strategies for taxonomy independent binning and visualization of sequences in shotgun metagenomics , 2016, Computational and structural biotechnology journal.

[41]  Francesca Giordano,et al.  Oxford Nanopore MinION Sequencing and Genome Assembly , 2016, Genom. Proteom. Bioinform..

[42]  Kristy Deiner,et al.  Environmental DNA metabarcoding: Transforming how we survey animal and plant communities , 2017, Molecular ecology.

[43]  Ye Yu,et al.  A novel data structure to support ultra-fast taxonomic classification of metagenomic sequences with k-mer signatures , 2017, Bioinform..

[44]  Frederick A. Matsen IV,et al.  Edge Principal Components and Squash Clustering: Using the Special Structure of Phylogenetic Placement Data for Sample Comparison , 2011, PloS one.

[45]  M-J Butel,et al.  Probiotics, gut microbiota and health. , 2014, Medecine et maladies infectieuses.

[46]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[47]  T. Porter,et al.  Scaling up: A guide to high‐throughput genomic approaches for biodiversity analysis , 2018, Molecular ecology.