A review of methods and databases for metagenomic classification and assembly

Microbiome research has grown rapidly over the past decade, with a proliferation of new methods that seek to make sense of large, complex data sets. Here, we survey two of the primary types of methods for analyzing microbiome data: read classification and metagenomic assembly, and we review some of the challenges facing these methods. All of the methods rely on public genome databases, and we also discuss the content of these databases and how their quality has a direct impact on our ability to interpret a microbiome sample.

[1]  Byung-Kwan Cho,et al.  Analysis of the mouse gut microbiome using full-length 16S rRNA amplicon sequencing , 2016, Scientific Reports.

[2]  S. Tringe,et al.  Primer and platform effects on 16S rRNA tag sequencing , 2015, Front. Microbiol..

[3]  Connor T. Skennerton,et al.  CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes , 2015, Genome research.

[4]  Holly M. Bik,et al.  PhyloSift: phylogenetic analysis of genomes and metagenomes , 2014, PeerJ.

[5]  Dongwan D. Kang,et al.  MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities , 2015, PeerJ.

[6]  Peer Bork,et al.  MOCAT2: a metagenomic assembly, annotation and profiling framework , 2016, Bioinform..

[7]  Evgeny M. Zdobnov,et al.  BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs , 2015, Bioinform..

[8]  Siu-Ming Yiu,et al.  IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler , 2010, RECOMB.

[9]  Alexandros Stamatakis,et al.  Metagenomic species profiling using universal phylogenetic marker genes , 2013, Nature Methods.

[10]  Sarah A. Butcher,et al.  k-SLAM: accurate and ultra-fast taxonomic classification and gene identification for large metagenomic data sets , 2016, Nucleic acids research.

[11]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[12]  Duy Tin Truong,et al.  Microbial strain-level population structure and genetic diversity from metagenomes , 2017, Genome research.

[13]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[14]  Daniel D. Sommer,et al.  MetAMOS: a modular and open source metagenomic assembly and analysis pipeline , 2013, Genome Biology.

[15]  Bernhard Y. Renard,et al.  DUDes: a top-down taxonomic profiler for metagenomics , 2016, Bioinform..

[16]  M. Frith,et al.  Adaptive seeds tame genomic sequence comparison. , 2011, Genome research.

[17]  K. Kupkova,et al.  Bioinformatics strategies for taxonomy independent binning and visualization of sequences in shotgun metagenomics , 2016, Computational and structural biotechnology journal.

[18]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[19]  J. Gilbert,et al.  Recovering complete and draft population genomes from metagenome datasets , 2016, Microbiome.

[20]  Piotr Gawron,et al.  VizBin - an application for reference-independent visualization and human-augmented binning of metagenomic data , 2015, Microbiome.

[21]  Tim H. Brom,et al.  A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data , 2012, 1203.4802.

[22]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[23]  Robert Schlaberg,et al.  Validation of Metagenomic Next-Generation Sequencing Tests for Universal Pathogen Detection. , 2017, Archives of pathology & laboratory medicine.

[24]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[25]  Steve Miller,et al.  Neurobrucellosis: Unexpected Answer From Metagenomic Next-Generation Sequencing , 2017, Journal of the Pediatric Infectious Diseases Society.

[26]  Brian C. Thomas,et al.  Community-wide analysis of microbial genome sequence signatures , 2009, Genome Biology.

[27]  Siu-Ming Yiu,et al.  MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample , 2012, Bioinform..

[28]  Måns Magnusson,et al.  MultiQC: summarize analysis results for multiple tools and samples in a single report , 2016, Bioinform..

[29]  P. Pevzner,et al.  metaSPAdes: a new versatile metagenomic assembler. , 2017, Genome research.

[30]  Ting Chen,et al.  COCACOLA: binning metagenomic contigs using sequence COmposition, read CoverAge, CO‐alignment and paired‐end read LinkAge , 2016, Bioinform..

[31]  Tanja Woyke,et al.  Viral dark matter and virus–host interactions resolved from publicly available microbial genomes , 2015, eLife.

[32]  John W. Taylor,et al.  One Fungus = One Name: DNA and fungal nomenclature twenty years after PCR , 2011, IMA fungus.

[33]  Tanja Woyke,et al.  Microbial dark matter ecogenomics reveals complex synergistic networks in a methanogenic bioreactor , 2015, The ISME Journal.

[34]  Andrew J. Davison,et al.  Consensus statement: Virus taxonomy in the age of metagenomics , 2017, Nature Reviews Microbiology.

[35]  Charles Y. Chiu,et al.  Clinical metagenomic identification of Balamuthia mandrillaris encephalitis and assembly of the draft genome: the continuing case for reference genome sequencing , 2015, Genome Medicine.

[36]  Duy Tin Truong,et al.  MetaPhlAn2 for enhanced metagenomic taxonomic profiling , 2015, Nature Methods.

[37]  Steven L. Salzberg,et al.  Unexpected cross-species contamination in genome sequencing projects , 2014, PeerJ.

[38]  P. Simmonds,et al.  Methods for virus classification and the challenge of incorporating metagenomic sequence data. , 2015, The Journal of general virology.

[39]  Susan Holmes,et al.  phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data , 2013, PloS one.

[40]  Alison S. Waller,et al.  Assessment of Metagenomic Assembly Using Simulated Next Generation Sequencing Data , 2012, PloS one.

[41]  C. Huttenhower,et al.  Sequencing and beyond: integrating molecular 'omics' for microbial community profiling , 2015, Nature Reviews Microbiology.

[42]  John L. Spouge,et al.  Nuclear ribosomal internal transcribed spacer (ITS) region as a universal DNA barcode marker for Fungi , 2012, Proceedings of the National Academy of Sciences.

[43]  S. Salzberg,et al.  Centrifuge: rapid and sensitive classification of metagenomic sequences , 2016, bioRxiv.

[44]  Siu-Ming Yiu,et al.  IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth , 2012, Bioinform..

[45]  David J. Beale,et al.  Beyond Metabolomics: A Review of Multi-Omics-Based Approaches , 2016 .

[46]  Alessandro Chiarucci,et al.  Old and new challenges in using species diversity for assessing biodiversity , 2011, Philosophical Transactions of the Royal Society B: Biological Sciences.

[47]  Tatiana A. Tatusova,et al.  Update on RefSeq microbial genomes resources , 2014, Nucleic Acids Res..

[48]  Paul Turner,et al.  Reagent and laboratory contamination can critically impact sequence-based microbiome analyses , 2014, BMC Biology.

[49]  Brian C. Thomas,et al.  Unusual biology across a group comprising more than 15% of domain Bacteria , 2015, Nature.

[50]  Anders F. Andersson,et al.  Binning metagenomic contigs by coverage and composition , 2014, Nature Methods.

[51]  Gary L. Gallia,et al.  Next-generation sequencing in neuropathologic diagnosis of infections of the nervous system , 2016, Neurology: Neuroimmunology & Neuroinflammation.

[52]  Georgios A. Pavlopoulos,et al.  Uncovering Earth’s virome , 2016, Nature.

[53]  Daniel Standage,et al.  The khmer software package: enabling efficient nucleotide sequence analysis , 2015, F1000Research.

[54]  Yu-Wei Wu,et al.  A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-Tuples , 2010, RECOMB.

[55]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[56]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..

[57]  Winston Timp,et al.  Presence of Human Hepegivirus-1 in a Cohort of People Who Inject Drugs , 2018, Annals of Internal Medicine.

[58]  Jacques Ravel,et al.  The vocabulary of microbiome research: a proposal , 2015, Microbiome.

[59]  D. Huson,et al.  SILVA, RDP, Greengenes, NCBI and OTT — how do these taxonomies compare? , 2017, BMC Genomics.

[60]  Kunihiko Sadakane,et al.  MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph , 2014, Bioinform..

[61]  Jonathan E. Allen,et al.  Genome sequence and comparative analysis of the model rodent malaria parasite Plasmodium yoelii yoelii , 2002, Nature.

[62]  Mihai Pop,et al.  A perspective on 16S rRNA operational taxonomic unit clustering using sequence similarity , 2016, npj Biofilms and Microbiomes.

[63]  Camilla Nesbø,et al.  Draft Genome Sequences of Three Smithella spp. Obtained from a Methanogenic Alkane-Degrading Culture and Oil Field Produced Water , 2014, Genome Announcements.

[64]  I. Nookaew,et al.  Insights from 20 years of bacterial genome sequencing , 2015, Functional & Integrative Genomics.

[65]  Tanja Woyke,et al.  Metagenomics uncovers gaps in amplicon-based detection of microbial diversity , 2016, Nature Microbiology.

[66]  Stefano Lonardi,et al.  Higher classification sensitivity of short metagenomic reads with CLARK-S , 2016 .

[67]  Sp Lapage,et al.  International Code of Nomenclature of Bacteria: Bacteriological Code, 1990 Revision , 1992 .

[68]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[69]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[70]  M. Pop,et al.  Sequence assembly demystified , 2013, Nature Reviews Genetics.

[71]  Ruiting Lan,et al.  Escherichia coli in disguise: molecular origins of Shigella. , 2002, Microbes and infection.

[72]  Daniel H. Huson,et al.  MEGAN Community Edition - Interactive Exploration and Analysis of Large-Scale Microbiome Sequencing Data , 2016, PLoS Comput. Biol..

[73]  Christopher R. Marks,et al.  Methanogenic paraffin degradation proceeds via alkane addition to fumarate by 'Smithella' spp. mediated by a syntrophic coupling with hydrogenotrophic methanogens. , 2016, Environmental microbiology.

[74]  Eoin L. Brodie,et al.  Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB , 2006, Applied and Environmental Microbiology.

[75]  Bin Ma,et al.  PatternHunter: faster and more sensitive homology search , 2002, Bioinform..

[76]  S. Koren,et al.  One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly. , 2015, Current opinion in microbiology.

[77]  Steven L. Salzberg,et al.  Bracken: Estimating species abundance in metagenomics data , 2016 .

[78]  Marcus J. Claesson,et al.  Comparing Apples and Oranges?: Next Generation Sequencing and Its Impact on Microbiome Analysis , 2016, PloS one.

[79]  Chao Xie,et al.  Fast and sensitive protein alignment using DIAMOND , 2014, Nature Methods.

[80]  Tom O. Delmont,et al.  Anvi’o: an advanced analysis and visualization platform for ‘omics data , 2015, PeerJ.

[81]  G. Cochrane,et al.  The International Nucleotide Sequence Database Collaboration , 2011, Nucleic Acids Res..

[82]  Brian D. Ondov,et al.  Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[83]  Ramon Rosselló-Móra,et al.  Classifying the uncultivated microbial majority: A place for metagenomic data in the Candidatus proposal. , 2015, Systematic and applied microbiology.

[84]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[85]  Scott Federhen,et al.  The NCBI Taxonomy database , 2011, Nucleic Acids Res..

[86]  María Luján Cuestas,et al.  New virus discovered in blood supply: Human hepegivirus-1 (HHpgV-1). , 2016, Revista Argentina de microbiologia.

[87]  Monzoorul Haque Mohammed,et al.  Classification of metagenomic sequences: methods and challenges , 2012, Briefings Bioinform..

[88]  Jing Zhao,et al.  Bioinformatics tools for quantitative and functional metagenome and metatranscriptome data analysis in microbes , 2017, Briefings Bioinform..

[89]  Donovan Parks,et al.  GroopM: an automated tool for the recovery of population genomes from related metagenomes , 2014, PeerJ.

[90]  George M Garrity,et al.  International Code of Nomenclature of Prokaryotes. , 2015, International journal of systematic and evolutionary microbiology.

[91]  David R. Riley,et al.  Ten years of pan-genome analyses. , 2015, Current opinion in microbiology.

[92]  Björn Usadel,et al.  Trimmomatic: a flexible trimmer for Illumina sequence data , 2014, Bioinform..

[93]  Shawn Rynearson,et al.  Taxonomer: an interactive metagenomics analysis portal for universal pathogen detection and host mRNA expression profiling , 2016, Genome Biology.

[94]  Kunihiko Sadakane,et al.  Succinct de Bruijn Graphs , 2012, WABI.

[95]  Paul J. McMurdie,et al.  DADA2: High resolution sample inference from Illumina amplicon data , 2016, Nature Methods.

[96]  Siu-Ming Yiu,et al.  MetaCluster 4.0: A Novel Binning Algorithm for NGS Reads and Huge Number of Species , 2012, J. Comput. Biol..

[97]  Rudolf Amann,et al.  Past and future species definitions for Bacteria and Archaea. , 2015, Systematic and applied microbiology.

[98]  John G Kenny,et al.  A comprehensive benchmarking study of protocols and sequencing platforms for 16S rRNA community profiling , 2016, BMC Genomics.

[99]  Lior Pachter,et al.  Pseudoalignment for metagenomic read assignment , 2015, Bioinform..

[100]  Frédéric Mahé,et al.  Swarm: robust and fast clustering method for amplicon-based studies , 2014, PeerJ.

[101]  C. Woese,et al.  Phylogenetic structure of the prokaryotic domain: The primary kingdoms , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[102]  E. Stackebrandt,et al.  Taxonomic note: implementation of the provisional status Candidatus for incompletely described procaryotes. , 1995, International journal of systematic bacteriology.

[103]  Alice C McHardy,et al.  PhyloPythiaS+: a self-training method for the rapid reconstruction of low-ranking taxonomic bins from metagenomes , 2014, PeerJ.

[104]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[105]  Carreño Carreño,et al.  Evaluación de la diversidad taxonómica y funcional de la comunidad microbiana relacionada con el ciclo del nitrógeno en suelos de cultivo de arroz con diferentes manejos del tamo , 2020 .

[106]  Lior Pachter,et al.  Near-optimal probabilistic RNA-seq quantification , 2016, Nature Biotechnology.

[107]  James R. Cole,et al.  The Ribosomal Database Project (RDP-II): sequences and tools for high-throughput rRNA analysis , 2004, Nucleic Acids Res..

[108]  T. Dreher,et al.  Towards long-read metagenomics: complete assembly of three novel genomes from bacteria dependent on a diazotrophic cyanobacterium in a freshwater lake co-culture , 2017, Standards in genomic sciences.

[109]  Camilla Nesbø,et al.  Re-analysis of omics data indicates Smithella may degrade alkanes by addition to fumarate under methanogenic conditions , 2014, The ISME Journal.

[110]  S. Salzberg,et al.  Fast algorithms for large-scale genome alignment and comparison. , 2002, Nucleic acids research.

[111]  Donald M. Jensen,et al.  Discovery of a Novel Human Pegivirus in Blood Associated with Hepatitis C Virus Co-Infection , 2016 .

[112]  Yiming Bao,et al.  NCBI Viral Genomes Resource , 2014, Nucleic Acids Res..

[113]  Rita Casadio,et al.  Algorithms in Bioinformatics, 5th International Workshop, WABI 2005, Mallorca, Spain, October 3-6, 2005, Proceedings , 2005, WABI.

[114]  Alexey A. Gurevich,et al.  MetaQUAST: evaluation of metagenome assemblies , 2016, Bioinform..

[115]  Blake A. Simmons,et al.  MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets , 2016, Bioinform..

[116]  Luiz Irber,et al.  sourmash: a library for MinHash sketching of DNA , 2016, J. Open Source Softw..

[117]  Duy Tin Truong,et al.  Strain-level microbial epidemiology and population genomics from shotgun metagenomics , 2016, Nature Methods.

[118]  Afiahayati,et al.  MetaVelvet-SL: an extension of the Velvet assembler to a de novo metagenomic assembler utilizing supervised learning , 2014, DNA research : an international journal for rapid publication of reports on genes and genomes.

[119]  M. Pop,et al.  Metagenomic Assembly: Overview, Challenges and Applications , 2016, The Yale journal of biology and medicine.

[120]  John Vollmers,et al.  Comparing and Evaluating Metagenome Assembly Tools from a Microbiologist’s Perspective - Not Only Size Matters! , 2017, PloS one.

[121]  Paul P. Gardner,et al.  An evaluation of the accuracy and speed of metagenome analysis tools , 2015 .

[122]  Brian O'Donovan,et al.  Metagenomic Sequencing Detects Respiratory Pathogens in Hematopoietic Cellular Transplant Patients. , 2017, American journal of respiratory and critical care medicine.

[123]  Guy Cochrane,et al.  Toward richer metadata for microbial sequences: replacing strain-level NCBI taxonomy taxids with BioProject, BioSample and Assembly records , 2014, Standards in genomic sciences.

[124]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[125]  Brian M. Hopkinson,et al.  Sizing up metatranscriptomics , 2012, The ISME Journal.

[126]  Dominique Lavenier,et al.  Critical Assessment of Metagenome Interpretation – a benchmark of computational metagenomics software , 2017, bioRxiv.

[127]  NoéLaurent,et al.  A Coverage Criterion for Spaced Seeds and Its Applications to Support Vector Machine String Kernels and k-Mer Distances , 2014 .

[128]  Marcel Martin Cutadapt removes adapter sequences from high-throughput sequencing reads , 2011 .

[129]  Po-E Li,et al.  Accurate read-based metagenome characterization using a hierarchical suite of unique signatures , 2015, Nucleic acids research.

[130]  Jens Roat Kultima,et al.  Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes , 2014, Nature Biotechnology.

[131]  Steven Salzberg,et al.  Clustering metagenomic sequences with interpolated Markov models , 2010, BMC Bioinformatics.

[132]  A. Mchardy,et al.  The PhyloPythiaS Web Server for Taxonomic Assignment of Metagenome Sequences , 2012, PloS one.

[133]  Maya Gokhale,et al.  Scalable metagenomic taxonomy classification using a reference genome database , 2013, Bioinform..

[134]  S. Lonardi,et al.  CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers , 2015, BMC Genomics.

[135]  Robert C. Edgar,et al.  UPARSE: highly accurate OTU sequences from microbial amplicon reads , 2013, Nature Methods.

[136]  Jonathan E. Allen,et al.  Searching more genomic sequence with less memory for fast and accurate metagenomic profiling , 2016, bioRxiv.

[137]  P. Baldrian,et al.  Microbial genomics, transcriptomics and proteomics: new discoveries in decomposition research using complementary methods , 2014, Applied Microbiology and Biotechnology.

[138]  Hélène Touzet,et al.  Assessment of Common and Emerging Bioinformatics Pipelines for Targeted Metagenomics , 2017, PloS one.

[139]  Gregory Kucherov,et al.  Spaced seeds improve k-mer-based metagenomic classification , 2015, Bioinform..

[140]  F. Raymond,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Ray Meta: scalable de novo metagenome assembly and profiling , 2012 .

[141]  S. Holmes,et al.  Bioconductor Workflow for Microbiome Data Analysis: from raw reads to community analyses , 2016, F1000Research.

[142]  Tomer Altman,et al.  A geographically-diverse collection of 418 human gut microbiome pathway genome databases , 2017, Scientific Data.

[143]  Jennifer M. Fettweis,et al.  The truth about metagenomics: quantifying and counteracting bias in 16S rRNA studies , 2015, BMC Microbiology.

[144]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[145]  Paul Wilmes,et al.  A decade of metaproteomics: Where we stand and what the future holds , 2015, Proteomics.

[146]  Alice Carolyn McHardy,et al.  Taxator-tk: precise taxonomic assignment of metagenomes by fast approximation of evolutionary neighborhoods , 2014, Bioinform..

[147]  Steven L. Salzberg,et al.  Re-analysis of metagenomic sequences from acute flaccid myelitis patients reveals alternatives to enterovirus D68 infection , 2015, F1000Research.

[148]  Georgios A. Pavlopoulos,et al.  Metagenomics: Tools and Insights for Analyzing Next-Generation Sequencing Data Derived from Biodiversity Studies , 2015, Bioinformatics and biology insights.

[149]  Anders Krogh,et al.  Fast and sensitive taxonomic classification for metagenomics with Kaiju , 2016, Nature Communications.

[150]  Scott Federhen,et al.  Type material in the NCBI Taxonomy Database , 2014, Nucleic Acids Res..

[151]  P. Hugenholtz,et al.  Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes , 2013, Nature Biotechnology.

[152]  Martin Hartmann,et al.  Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities , 2009, Applied and Environmental Microbiology.

[153]  Sp Lapage,et al.  International Code of Nomenclature of Bacteria , 1992 .

[154]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[155]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[156]  Rob Knight,et al.  Open-Source Sequence Clustering Methods Improve the State Of the Art , 2016, mSystems.

[157]  Yasubumi Sakakibara,et al.  MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads , 2012, Nucleic acids research.