Analysis Methods for Shotgun Metagenomics

The development of whole metagenome shotgun sequencing (WGS) has enabled the precise characterization of taxonomic diversity and functional capabilities of microbial communities in situ while obviating organism isolation and cultivation procedures. WGS created with second- and third-generation sequencing technologies will generate millions of reads and tens (or hundreds) of gigabytes of information about the organisms under investigation. Despite containing an immense amount of information, the reads are unorganized and unlabeled, leading to a significant challenge in discerning from which genome a read originated. Thus, analysis of WGS data necessitates first determining community structure and function from the raw reads before the focus can shift to making multi-sample comparisons. A typical WGS workflow consists of read assignment (taxonomic binning and classification), preprocessing techniques (normalization, dimensionality reduction), exploratory approaches (feature selection and extraction, ordination), statistical inference (regression, constrained ordination, differential abundance analysis), and machine learning. The following chapter provides an overview of these analytical approaches (including challenges and possible pitfalls that may be encountered by researchers) as well as steps toward their solutions. Relevant software packages and resources are also discussed.

[1]  Peter Williams,et al.  IMG: the integrated microbial genomes database and comparative analysis system , 2011, Nucleic Acids Res..

[2]  K. R. Clarke,et al.  A Method Of Linking Multivariate Community Structure To Environmental Variables , 1993 .

[3]  Erik Kristiansson,et al.  Statistical evaluation of methods for identification of differentially abundant genes in comparative metagenomics , 2016, BMC Genomics.

[4]  Tulika Prakash,et al.  Functional assignment of metagenomic data: challenges and applications , 2012, Briefings Bioinform..

[5]  Stefano Lonardi,et al.  Comprehensive benchmarking and ensemble approaches for metagenomic classifiers , 2017, Genome Biology.

[6]  Atul J Butte,et al.  Robust meta-analysis of gene expression using the elastic net , 2015, Nucleic acids research.

[7]  A. Ramette Multivariate analyses in microbial ecology , 2007, FEMS microbiology ecology.

[8]  Gregory Ditzler,et al.  Fizzy: feature subset selection for metagenomics , 2015, BMC Bioinformatics.

[9]  P. Hugenholtz,et al.  Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes , 2013, Nature Biotechnology.

[10]  Thomas M. Cover,et al.  Elements of Information Theory: Cover/Elements of Information Theory, Second Edition , 2005 .

[11]  Martin Hartmann,et al.  Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities , 2009, Applied and Environmental Microbiology.

[12]  R. Ley,et al.  Ecological and Evolutionary Forces Shaping Microbial Diversity in the Human Intestine , 2006, Cell.

[13]  J. Kruskal Nonmetric multidimensional scaling: A numerical method , 1964 .

[14]  Hongzhe Li Microbiome, Metagenomics, and High-Dimensional Compositional Data Analysis , 2015 .

[15]  Wolfgang R Streit,et al.  Metagenomics--the key to the uncultured microbes. , 2004, Current opinion in microbiology.

[16]  Philip Hugenholtz,et al.  A renaissance for the pioneering 16S rRNA gene. , 2008, Current opinion in microbiology.

[17]  P. Bork,et al.  A human gut microbial gene catalogue established by metagenomic sequencing , 2010, Nature.

[18]  Po-E Li,et al.  Accurate read-based metagenome characterization using a hierarchical suite of unique signatures , 2015, Nucleic acids research.

[19]  B. McCune,et al.  Analysis of Ecological Communities , 2002 .

[20]  Calyampudi R. Rao The use and interpretation of principal component analysis in applied research , 1964 .

[21]  Wolfgang Huber,et al.  analysis of count data { the DESeq2 package , 2015 .

[22]  X. Chen,et al.  Random forests for genomic data analysis. , 2012, Genomics.

[23]  C. Huttenhower,et al.  Metagenomic microbial community profiling using unique clade-specific marker genes , 2012, Nature Methods.

[24]  Matthew C. B. Tsilimigras,et al.  Compositional data analysis of the microbiome: fundamentals, tools, and challenges. , 2016, Annals of epidemiology.

[25]  B. Barrell,et al.  Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence , 1998, Nature.

[26]  Monzoorul Haque Mohammed,et al.  Classification of metagenomic sequences: methods and challenges , 2012, Briefings Bioinform..

[27]  Gail L. Rosen,et al.  Metagenome Fragment Classification Using N-Mer Frequency Profiles , 2008, Adv. Bioinformatics.

[28]  Susumu Goto,et al.  Data, information, knowledge and principle: back to metabolism in KEGG , 2013, Nucleic Acids Res..

[29]  Anders F. Andersson,et al.  Binning metagenomic contigs by coverage and composition , 2014, Nature Methods.

[30]  Anushya Muruganujan,et al.  PANTHER in 2013: modeling the evolution of gene function, and other gene attributes, in the context of phylogenetic trees , 2012, Nucleic Acids Res..

[31]  Alexandros Stamatakis,et al.  Metagenomic species profiling using universal phylogenetic marker genes , 2013, Nature Methods.

[32]  Zhaojun Bai,et al.  CompostBin: A DNA Composition-Based Algorithm for Binning Environmental Shotgun Reads , 2007, RECOMB.

[33]  Stephanie C. Hicks,et al.  Analysis and correction of compositional bias in sparse sequencing count data , 2017 .

[34]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[35]  C. Braak Canonical Correspondence Analysis: A New Eigenvector Technique for Multivariate Direct Gradient Analysis , 1986 .

[36]  Naryttza N. Diaz,et al.  The Subsystems Approach to Genome Annotation and its Use in the Project to Annotate 1000 Genomes , 2005, Nucleic acids research.

[37]  Omer Levy,et al.  Improving Distributional Similarity with Lessons Learned from Word Embeddings , 2015, TACL.

[38]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[39]  Trevor Hastie,et al.  Statistical Learning with Sparsity: The Lasso and Generalizations , 2015 .

[40]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[41]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[42]  Francisco M. Cornejo-Castillo,et al.  Viral to metazoan marine plankton nucleotide sequences from the Tara Oceans expedition , 2017, Scientific Data.

[43]  M. Breitbart,et al.  Using pyrosequencing to shed light on deep mine microbial ecology , 2006, BMC Genomics.

[44]  M. Pignatelli,et al.  Comparison of different assembly and annotation tools on analysis of simulated viral metagenomic communities in the gut , 2014, BMC Genomics.

[45]  K. Kupkova,et al.  Bioinformatics strategies for taxonomy independent binning and visualization of sequences in shotgun metagenomics , 2016, Computational and structural biotechnology journal.

[46]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[47]  M. Pignatelli,et al.  The oral metagenome in health and disease , 2011, The ISME Journal.

[48]  I. Miller,et al.  Interpreting Microbial Biosynthesis in the Genomic Age: Biological and Practical Considerations , 2017, Marine drugs.

[49]  Katherine S Pollard,et al.  Average genome size estimation improves comparative metagenomics and sheds light on the functional ecology of the human microbiome , 2015, Genome Biology.

[50]  J. Banfield,et al.  Community structure and metabolism through reconstruction of microbial genomes from the environment , 2004, Nature.

[51]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[52]  R. Daniel,et al.  Metagenomic Analyses: Past and Future Trends , 2010, Applied and Environmental Microbiology.

[53]  Maya Gokhale,et al.  Scalable metagenomic taxonomy classification using a reference genome database , 2013, Bioinform..

[54]  Yifan Peng,et al.  Opportunities and obstacles for deep learning in biology and medicine , 2017 .

[55]  Inna Dubchak,et al.  The Genome Portal of the Department of Energy Joint Genome Institute , 2011, Nucleic Acids Res..

[56]  S. Lonardi,et al.  CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers , 2015, BMC Genomics.

[57]  Christian L. Müller,et al.  Sparse and Compositionally Robust Inference of Microbial Ecological Networks , 2014, PLoS Comput. Biol..

[58]  I-Min A. Chen,et al.  IMG/M: a data management and analysis system for metagenomes , 2007, Nucleic Acids Res..

[59]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[60]  J. Handelsman,et al.  Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. , 1998, Chemistry & biology.

[61]  Rick L. Stevens,et al.  The RAST Server: Rapid Annotations using Subsystems Technology , 2008, BMC Genomics.

[62]  Siu-Ming Yiu,et al.  MetaCluster 4.0: A Novel Binning Algorithm for NGS Reads and Huge Number of Species , 2012, J. Comput. Biol..

[63]  Emidio Capriotti,et al.  Bioinformatics Original Paper Predicting the Insurgence of Human Genetic Diseases Associated to Single Point Protein Mutations with Support Vector Machines and Evolutionary Information , 2022 .

[64]  Lawrence A. David,et al.  A phylogenetic transform enhances analysis of compositional microbiota data , 2016, bioRxiv.

[65]  J. Tiedje,et al.  Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy , 2007, Applied and Environmental Microbiology.

[66]  R. Edwards,et al.  Insights into antibiotic resistance through metagenomic approaches. , 2012, Future microbiology.

[67]  A. Gelman,et al.  The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant , 2006 .

[68]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[69]  Robert D. Finn,et al.  InterPro: the integrative protein signature database , 2008, Nucleic Acids Res..

[70]  M. David,et al.  Metagenomic analysis of a permafrost microbial community reveals a rapid response to thaw , 2011, Nature.

[71]  Paul P. Gardner,et al.  An evaluation of the accuracy and speed of metagenome analysis tools , 2015, Scientific Reports.

[72]  O. Paliy,et al.  Application of multivariate statistical techniques in microbial ecology , 2016, Molecular ecology.

[73]  O. White,et al.  Environmental Genome Shotgun Sequencing of the Sargasso Sea , 2004, Science.

[74]  Peter D. Hoff,et al.  A First Course in Bayesian Statistical Methods , 2009 .

[75]  Donovan Parks,et al.  GroopM: an automated tool for the recovery of population genomes from related metagenomes , 2014, PeerJ.

[76]  Vincent Ferretti,et al.  Integrating alignment-based and alignment-free sequence similarity measures for biological sequence classification , 2014, Bioinform..

[77]  R. Edwards,et al.  Fast Identification and Removal of Sequence Contamination from Genomic and Metagenomic Datasets , 2011, PloS one.

[78]  N. Ward New directions and interactions in metagenomics research. , 2006, FEMS microbiology ecology.

[79]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[80]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[81]  Oscar Monroy,et al.  Multiple syntrophic interactions in a terephthalate-degrading methanogenic consortium , 2011, The ISME Journal.

[82]  T. Takagi,et al.  MetaGene: prokaryotic gene finding from environmental genome shotgun sequences , 2006, Nucleic acids research.

[83]  Gabriel Valiente,et al.  Computational challenges of sequence classification in microbiomic data , 2011, Briefings Bioinform..

[84]  J Craig Venter,et al.  The Sequence of the Human Genome. , 2015, Clinical chemistry.

[85]  K. Pollard,et al.  Toward Accurate and Quantitative Comparative Metagenomics , 2016, Cell.

[86]  Alan Bridge,et al.  New and continuing developments at PROSITE , 2012, Nucleic Acids Res..

[87]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[88]  C.J.F. ter Braak,et al.  Canonical community ordination. Part I: Basic theory and linear methods , 1994 .

[89]  László Orlóci,et al.  Applying Metric and Nonmetric Multidimensional Scaling to Ecological Studies: Some New Results , 1986 .

[90]  Joseph N. Paulson,et al.  metagenomeSeq: Statistical analysis for sparse high-throughput sequencing , 2017 .

[91]  Elhanan Borenstein,et al.  MUSiCC: a marker genes based framework for metagenomic normalization and accurate profiling of gene abundances in the microbiome , 2014, bioRxiv.

[92]  Monzoorul Haque Mohammed,et al.  INDUS - a composition-based approach for rapid and accurate taxonomic classification of metagenomic sequences , 2011, BMC Genomics.

[93]  R. Knight,et al.  Quantitative and Qualitative β Diversity Measures Lead to Different Insights into Factors That Structure Microbial Communities , 2007, Applied and Environmental Microbiology.

[94]  Haixu Tang,et al.  FragGeneScan: predicting genes in short and error-prone reads , 2010, Nucleic acids research.

[95]  Katherine H. Huang,et al.  Structure, Function and Diversity of the Healthy Human Microbiome , 2012, Nature.

[96]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[97]  Elisabeth Coudert,et al.  HAMAP in 2013, new developments in the protein family classification and annotation system , 2012, Nucleic Acids Res..

[98]  Byung-Jun Yoon,et al.  Hidden Markov Models and their Applications in Biological Sequence Analysis , 2009, Current genomics.

[99]  P. Legendre,et al.  Forward selection of explanatory variables. , 2008, Ecology.

[100]  C. Woese,et al.  Phylogenetic structure of the prokaryotic domain: The primary kingdoms , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[101]  Nobutada Kimura,et al.  Metagenomic approaches to understanding phylogenetic diversity in quorum sensing , 2014, Virulence.

[102]  Susan P. Holmes,et al.  Waste Not , Want Not : Why Rarefying Microbiome Data is Inadmissible . October 1 , 2013 , 2013 .

[103]  Alexandru I. Tomescu,et al.  MetaFlow: Metagenomic profiling based on whole-genome coverage analysis with min-cost flows , 2016 .

[104]  Daniel J. Blankenberg,et al.  Galaxy: a platform for interactive large-scale genome analysis. , 2005, Genome research.

[105]  Wen J. Li,et al.  Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation , 2015, Nucleic Acids Res..

[106]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[107]  T. Vogel,et al.  The future of skin metagenomics. , 2014, Research in microbiology.

[108]  Robert D. Finn,et al.  The Pfam protein families database: towards a more sustainable future , 2015, Nucleic Acids Res..

[109]  E. Purdom,et al.  Diversity of the Human Intestinal Microbial Flora , 2005, Science.

[110]  J. Gilbert,et al.  Recovering complete and draft population genomes from metagenome datasets , 2016, Microbiome.

[111]  R. Tibshirani,et al.  Prediction by Supervised Principal Components , 2006 .

[112]  Larry A. Rendell,et al.  A Practical Approach to Feature Selection , 1992, ML.

[113]  Pierre Legendre,et al.  Numerical Ecology with R , 2011 .

[114]  B. Hurwitz,et al.  Computational prospecting the great viral unknown. , 2016, FEMS microbiology letters.

[115]  A Bairoch,et al.  SWISS-PROT: connecting biomolecular knowledge via a protein database. , 2001, Current issues in molecular biology.

[116]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[117]  J. Gower Some distance properties of latent root and vector methods used in multivariate analysis , 1966 .

[118]  Elhanan Borenstein,et al.  Comparative Analysis of Functional Metagenomic Annotation and the Mappability of Short Reads , 2014, PloS one.

[119]  F. Bushman,et al.  QIIME allows integration and analysis of high-throughput community sequencing data. Nat. Meth. , 2010 .

[120]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[121]  Larry A. Rendell,et al.  The Feature Selection Problem: Traditional Methods and a New Algorithm , 1992, AAAI.

[122]  J. Ronholm,et al.  Metagenomics: The Next Culture-Independent Game Changer , 2017, Front. Microbiol..

[123]  Mihai Pop,et al.  MetaPhyler: Taxonomic profiling for metagenomic sequences , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[124]  Gregory B. Gloor,et al.  Compositional analysis: a valid approach to analyze microbiome high-throughput sequencing data. , 2016, Canadian journal of microbiology.

[125]  C. Huttenhower,et al.  Metagenomic biomarker discovery and explanation , 2011, Genome Biology.

[126]  R. Knight,et al.  The Effect of Diet on the Human Gut Microbiome: A Metagenomic Analysis in Humanized Gnotobiotic Mice , 2009, Science Translational Medicine.

[127]  Holly M. Bik,et al.  PhyloSift: phylogenetic analysis of genomes and metagenomes , 2014, PeerJ.

[128]  Bernard Henrissat,et al.  Metabolic Reconstruction for Metagenomic Data and Its Application to the Human Microbiome , 2012, PLoS Comput. Biol..

[129]  Bas E. Dutilh,et al.  Computational approaches to predict bacteriophage–host relationships , 2015, FEMS microbiology reviews.

[130]  Susumu Goto,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 2000, Nucleic Acids Res..

[131]  Alain F. Zuur,et al.  A protocol for data exploration to avoid common statistical problems , 2010 .

[132]  J. Handelsman Metagenomics: Application of Genomics to Uncultured Microorganisms , 2004, Microbiology and Molecular Biology Reviews.

[133]  M. Ferrer,et al.  Metagenomics approaches in systems microbiology. , 2009, FEMS microbiology reviews.

[134]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[135]  Gavin Brown,et al.  Conditional Likelihood Maximisation: A Unifying Framework for Information Theoretic Feature Selection , 2012, J. Mach. Learn. Res..

[136]  K. Wrighton,et al.  The bright side of microbial dark matter: lessons learned from the uncultivated majority. , 2016, Current opinion in microbiology.

[137]  Mihai Pop,et al.  DNACLUST: accurate and efficient clustering of phylogenetic marker genes , 2011, BMC Bioinformatics.

[138]  R. Knight,et al.  UniFrac: a New Phylogenetic Method for Comparing Microbial Communities , 2005, Applied and Environmental Microbiology.

[139]  Pierre Legendre,et al.  DISTANCE‐BASED REDUNDANCY ANALYSIS: TESTING MULTISPECIES RESPONSES IN MULTIFACTORIAL ECOLOGICAL EXPERIMENTS , 1999 .

[140]  H. Hirschfeld A Connection between Correlation and Contingency , 1935, Mathematical Proceedings of the Cambridge Philosophical Society.

[141]  Omer Levy,et al.  Neural Word Embedding as Implicit Matrix Factorization , 2014, NIPS.

[142]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[143]  Alison S. Waller,et al.  Assessment of Metagenomic Assembly Using Simulated Next Generation Sequencing Data , 2012, PloS one.

[144]  N. Pace,et al.  The Analysis of Natural Microbial Populations by Ribosomal RNA Sequences , 1986 .

[145]  Monzoorul Haque Mohammed,et al.  SPHINX - an algorithm for taxonomic binning of metagenomic sequences , 2011, Bioinform..

[146]  Xinghua Shi,et al.  A deep auto-encoder model for gene expression prediction , 2017, BMC Genomics.

[147]  Gail L. Rosen,et al.  NBC: the Naïve Bayes Classification tool webserver for taxonomic classification of metagenomic reads , 2010, Bioinform..

[148]  Sébastien Carrère,et al.  The ProDom database of protein domain families: more emphasis on 3D , 2004, Nucleic Acids Res..