Scalable metagenomic taxonomy classification using a reference genome database

Motivation: Deep metagenomic sequencing of biological samples has the potential to recover otherwise difficult-to-detect microorganisms and accurately characterize biological samples with limited prior knowledge of sample contents. Existing metagenomic taxonomic classification algorithms, however, do not scale well to analyze large metagenomic datasets, and balancing classification accuracy with computational efficiency presents a fundamental challenge. Results: A method is presented to shift computational costs to an off-line computation by creating a taxonomy/genome index that supports scalable metagenomic classification. Scalable performance is demonstrated on real and simulated data to show accurate classification in the presence of novel organisms on samples that include viruses, prokaryotes, fungi and protists. Taxonomic classification of the previously published 150 giga-base Tyrolean Iceman dataset was found to take <20 h on a single node 40 core large memory machine and provide new insights on the metagenomic contents of the sample. Availability: Software was implemented in C++ and is freely available at http://sourceforge.net/projects/lmat Contact: allen99@llnl.gov Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  B. Tümmler,et al.  Genometa - A Fast and Accurate Classifier for Short Metagenomic Shotgun Reads , 2012, PloS one.

[2]  Siu-Ming Yiu,et al.  A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio , 2011, Bioinform..

[3]  Jason Evans April A Scalable Concurrent malloc(3) Implementation for FreeBSD , 2006 .

[4]  Samuel V. Angiuoli,et al.  Resources and Costs for Microbial Sequence Analysis Evaluated Using Virtual Machines and Cloud Computing , 2011, PloS one.

[5]  Alison S. Waller,et al.  Assessment of Metagenomic Assembly Using Simulated Next Generation Sequencing Data , 2012, PloS one.

[6]  Frank Oliver Glöckner,et al.  Current opportunities and challenges in microbial metagenome analysis—a bioinformatic perspective , 2012, Briefings Bioinform..

[7]  Vineet K. Sharma,et al.  Fast and Accurate Taxonomic Assignments of Metagenomic Sequences Using MetaBin , 2012, PloS one.

[8]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[9]  Monzoorul Haque Mohammed,et al.  Classification of metagenomic sequences: methods and challenges , 2012, Briefings Bioinform..

[10]  Andreas Wilke,et al.  phylogenetic and functional analysis of metagenomes , 2022 .

[11]  Natalie M. Myres,et al.  New insights into the Tyrolean Iceman's origin and phenotype as inferred by whole-genome sequencing , 2012, Nature Communications.

[12]  Tom Slezak,et al.  Scalable SNP Analyses of 100+ Bacterial or Viral Genomes , 2010 .

[13]  Daniel H. Huson,et al.  MetaSim—A Sequencing Simulator for Genomics and Metagenomics , 2008, PloS one.

[14]  S. Salzberg,et al.  PhymmBL expanded: confidence scores, custom databases, parallelization and more , 2011, Nature Methods.

[15]  Adam M. Phillippy,et al.  Interactive metagenomic visualization in a Web browser , 2011, BMC Bioinformatics.

[16]  B. Berger,et al.  Compressive genomics , 2012, Nature Biotechnology.

[17]  M. Pop,et al.  Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences , 2011, BMC Genomics.

[18]  Arend Hintze,et al.  Scaling metagenome sequence assembly with probabilistic de Bruijn graphs , 2011, Proceedings of the National Academy of Sciences.

[19]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[20]  S. Young,et al.  Plantagora: Modeling Whole Genome Sequencing and Assembly of Plant Genomes , 2011, PloS one.

[21]  Michael C. Schatz,et al.  Cloud Computing and the DNA Data Race , 2010, Nature Biotechnology.

[22]  C. Huttenhower,et al.  Metagenomic microbial community profiling using unique clade-specific marker genes , 2012, Nature Methods.

[23]  Zhengyuan O. Wang,et al.  Optimizing Read Mapping to Reference Genomes to Determine Composition and Species Prevalence in Microbial Communities , 2012, PloS one.

[24]  Alice Carolyn McHardy,et al.  Taxonomic binning of metagenome samples generated by next-generation sequencing technologies , 2012, Briefings Bioinform..

[25]  Judith D. Cohn,et al.  Rapid phylogenetic and functional classification of short genomic fragments with signature peptides , 2012, BMC Research Notes.

[26]  O. White,et al.  Environmental Genome Shotgun Sequencing of the Sargasso Sea , 2004, Science.

[27]  Rick L. Stevens,et al.  Unlocking the potential of metagenomics through replicated experimental design , 2012, Nature Biotechnology.

[28]  Monzoorul Haque Mohammed,et al.  SPHINX - an algorithm for taxonomic binning of metagenomic sequences , 2011, Bioinform..