Computational and Statistical Considerations in the Analysis of Metagenomic Data

In shotgun metagenomics, microbial communities are studied by random DNA fragments sequenced directly from environmental and clinical samples. The resulting data is massive, potentially consisting of billions of sequence reads describing millions of microbial genes. The data interpretation is therefore nontrivial and dependent on dedicated computational and statistical methods. In this chapter we discuss the many challenges associated with the analysis of shotgun metagenomic data. First, we address computational issues related to the quantification of genes in metagenomes. We describe algorithms for efficient sequence comparisons, recommended practices for setting up data workflows and modern high-performance computer resources that can be used to perform the analysis. Next, we outline the statistical aspects, including removal of systematic errors and how to identify differences between microbial communities from different experimental conditions. We conclude by underlining the increasing importance of efficient and reliable computational and statistical solutions in the analysis of large metagenomic datasets.

[1]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[2]  M. Pop,et al.  Robust methods for differential abundance analysis in marker gene surveys , 2013, Nature Methods.

[3]  M. Baker 1,500 scientists lift the lid on reproducibility , 2016, Nature.

[4]  Katherine H. Huang,et al.  Structure, Function and Diversity of the Healthy Human Microbiome , 2012, Nature.

[5]  Christina Boucher,et al.  Use of Metagenomic Shotgun Sequencing Technology To Detect Foodborne Pathogens within the Microbiome of the Beef Production Chain , 2016, Applied and Environmental Microbiology.

[6]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[7]  Elhanan Borenstein,et al.  Systematic Characterization and Analysis of the Taxonomic Drivers of Functional Shifts in the Human Microbiome. , 2017, Cell host & microbe.

[8]  R. O’Hara,et al.  Do not log‐transform count data , 2010 .

[9]  Ümit V. Çatalyürek,et al.  Benchmarking short sequence mapping tools , 2013, BMC Bioinformatics.

[10]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[11]  Sven Rahmann,et al.  Snakemake--a scalable bioinformatics workflow engine. , 2012, Bioinformatics.

[12]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[13]  Alessandro Vespignani,et al.  Association between Recruitment Methods and Attrition in Internet-Based Studies , 2014, PloS one.

[14]  Peter F. Stadler,et al.  Fast Mapping of Short Sequences with Mismatches, Insertions and Deletions Using Index Structures , 2009, PLoS Comput. Biol..

[15]  Susan P. Holmes,et al.  Waste Not , Want Not : Why Rarefying Microbiome Data is Inadmissible . October 1 , 2013 , 2013 .

[16]  M. Robinson,et al.  A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[17]  R. Daniel Bergeron,et al.  PALADIN: protein alignment for functional profiling whole metagenome shotgun data , 2016, bioRxiv.

[18]  Erik Kristiansson,et al.  ShotgunFunctionalizeR: an R-package for functional comparison of metagenomes , 2009, Bioinform..

[19]  John C. Wooley,et al.  Metagenomics: Facts and Artifacts, and Computational Challenges , 2010, Journal of Computer Science and Technology.

[20]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[21]  Ruiqiang Li,et al.  SOAP: short oligonucleotide alignment program , 2008, Bioinform..

[22]  Matthew Ruffalo,et al.  Comparative analysis of algorithms for next-generation sequencing read alignment , 2011, Bioinform..

[23]  Peter Li,et al.  GigaDB: announcing the GigaScience database , 2012, GigaScience.

[24]  Pablo Prieto,et al.  The impact of Docker containers on the performance of genomic pipelines , 2015, PeerJ.

[25]  Nandan S. Gokhale,et al.  Knotty Zika Virus Blocks Exonuclease to Produce Subgenomic Flaviviral RNAs. , 2017, Cell host & microbe.

[26]  Chao-Tung Yang,et al.  G-BLAST: a Grid-based solution for mpiBLAST on computational Grids , 2009 .

[27]  Stephan Frickenhaus,et al.  Average genome size: a potential source of bias in comparative metagenomics , 2010, The ISME Journal.

[28]  Ben Nichols,et al.  Distributed under Creative Commons Cc-by 4.0 Vsearch: a Versatile Open Source Tool for Metagenomics , 2022 .

[29]  Michael C. Schatz,et al.  Cloud Computing and the DNA Data Race , 2010, Nature Biotechnology.

[30]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[31]  Carole A. Goble,et al.  The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud , 2013, Nucleic Acids Res..

[32]  Erik Kristiansson,et al.  Statistical evaluation of methods for identification of differentially abundant genes in comparative metagenomics , 2016, BMC Genomics.

[33]  Tungadri Bose,et al.  COGNIZER: A Framework for Functional Annotation of Metagenomic Datasets , 2015, PloS one.

[34]  Erik Kristiansson,et al.  Variability in Metagenomic Count Data and Its Influence on the Identification of Differentially Abundant Genes , 2017, J. Comput. Biol..

[35]  Katherine S Pollard,et al.  Average genome size estimation improves comparative metagenomics and sheds light on the functional ecology of the human microbiome , 2015, Genome Biology.

[36]  Rick L. Stevens,et al.  Unlocking the potential of metagenomics through replicated experimental design , 2012, Nature Biotechnology.

[37]  Scott Ferson,et al.  Accounting for uncertainty in DNA sequencing data. , 2015, Trends in genetics : TIG.

[38]  L. Stein The case for cloud computing in genome informatics , 2010, Genome Biology.

[39]  W. Kruskal,et al.  Use of Ranks in One-Criterion Variance Analysis , 1952 .

[40]  C. Huttenhower,et al.  Metagenomic biomarker discovery and explanation , 2011, Genome Biology.

[41]  John C. Wooley,et al.  A Primer on Metagenomics , 2010, PLoS Comput. Biol..

[42]  Mihai Pop,et al.  Statistical Methods for Detecting Differentially Abundant Features in Clinical Metagenomic Samples , 2009, PLoS Comput. Biol..

[43]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[44]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[45]  Bernard Henrissat,et al.  Metabolic Reconstruction for Metagenomic Data and Its Application to the Human Microbiome , 2012, PLoS Comput. Biol..

[46]  P. Bayrak-Toydemir,et al.  Hereditary hemorrhagic telangiectasia: genetics and molecular diagnostics in a new era , 2015, Front. Genet..

[47]  J. McPherson,et al.  Coming of age: ten years of next-generation sequencing technologies , 2016, Nature Reviews Genetics.

[48]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[49]  Intawat Nookaew,et al.  Metagenomic Data Utilization and Analysis (MEDUSA) and Construction of a Global Gut Microbial Gene Catalogue , 2014, PLoS Comput. Biol..

[50]  Siu-Ming Yiu,et al.  SOAP2: an improved ultrafast tool for short read alignment , 2009, Bioinform..

[51]  Christopher W. V. Hogue,et al.  NBLAST: a cluster variant of BLAST for NxN comparisons , 2002, BMC Bioinformatics.

[52]  E. Kristiansson,et al.  Tentacle: distributed quantification of genes in metagenomes , 2015, GigaScience.

[53]  Jesse R. Zaneveld,et al.  Normalization and microbial differential abundance strategies depend upon data characteristics , 2017, Microbiome.

[54]  Sandrine Dudoit,et al.  Multiple Testing Procedures: the multtest Package and Applications to Genomics , 2005 .

[55]  David R. Riley,et al.  CloVR: A virtual machine for automated and portable sequence analysis from the desktop using cloud computing , 2011, BMC Bioinformatics.

[56]  P. Chain,et al.  Next generation sequencing and bioinformatic bottlenecks: the current state of metagenomic data analysis. , 2012, Current opinion in biotechnology.

[57]  Gary King,et al.  An Introduction to the Dataverse Network as an Infrastructure for Data Sharing , 2007 .

[58]  Di Tommaso Paolo,et al.  A novel tool for highly scalable computational pipelines , 2014 .

[59]  Ola Spjuth,et al.  Towards agile large-scale predictive modelling in drug discovery with flow-based programming design principles , 2016, Journal of Cheminformatics.

[60]  M. Stephens,et al.  RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. , 2008, Genome research.

[61]  Peer Bork,et al.  MOCAT2: a metagenomic assembly, annotation and profiling framework , 2016, Bioinform..

[62]  Susumu Goto,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 2000, Nucleic Acids Res..

[63]  Brian Bushnell,et al.  BBMap: A Fast, Accurate, Splice-Aware Aligner , 2014 .

[64]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[65]  David R. Mathog,et al.  Parallel BLAST on split databases , 2003, Bioinform..

[66]  Björn Usadel,et al.  Trimmomatic: a flexible trimmer for Illumina sequence data , 2014, Bioinform..

[67]  Robert G. Beiko,et al.  STAMP: statistical analysis of taxonomic and functional profiles , 2014, Bioinform..

[68]  F. Wilcoxon,et al.  Individual comparisons of grouped data by ranking methods. , 1946, Journal of economic entomology.

[69]  Leo Goodstadt,et al.  Ruffus: a lightweight Python library for computational pipelines , 2010, Bioinform..

[70]  Susan Holmes,et al.  phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data , 2013, PloS one.

[71]  Andreas Wilke,et al.  The MG-RAST metagenomics database and portal in 2015 , 2015, Nucleic Acids Res..

[72]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[73]  Lingling An,et al.  A robust approach for identifying differentially abundant features in metagenomic samples , 2015, Bioinform..

[74]  Alexander Sczyrba,et al.  Bioboxes: standardised containers for interchangeable bioinformatics software , 2015, GigaScience.

[75]  Chao Xie,et al.  Fast and sensitive protein alignment using DIAMOND , 2014, Nature Methods.

[76]  M. Pop,et al.  Metagenomic Assembly: Overview, Challenges and Applications , 2016, The Yale journal of biology and medicine.

[77]  Erik Kristiansson,et al.  Integrative Analysis of Omics Data , 2017 .

[78]  U. Haque,et al.  Spatiotemporal Clustering Analysis and Risk Assessments of Human Cutaneous Anthrax in China, 2005–2012 , 2015, PloS one.

[79]  Hugh P. Shanahan,et al.  Bioinformatics on the Cloud Computing Platform Azure , 2014, PloS one.

[80]  A. Sanchez-Flores,et al.  The Road to Metagenomics: From Microbiology to DNA Sequencing Technologies and Bioinformatics , 2015, Front. Genet..

[81]  Nicolas Servant,et al.  A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis , 2013, Briefings Bioinform..

[82]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[83]  Jeremy Leipzig,et al.  A review of bioinformatic pipeline frameworks , 2016, Briefings Bioinform..

[84]  Erik Kristiansson,et al.  HirBin: high-resolution identification of differentially abundant functions in metagenomes , 2017, BMC Genomics.

[85]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[86]  J. Clemente,et al.  Human gut microbiome viewed across age and geography , 2012, Nature.

[87]  Daniel H. Huson,et al.  Visual and statistical comparison of metagenomes , 2009, Bioinform..

[88]  Bernard J. Pope,et al.  Bpipe: a tool for running and managing bioinformatics pipelines , 2012, Bioinform..

[89]  E. Kristiansson,et al.  The structure and diversity of human, animal and environmental resistomes , 2016, Microbiome.

[90]  Dawn Field,et al.  Open software for biologists: from famine to feast , 2006, Nature Biotechnology.

[91]  F. J. Anscombe,et al.  THE TRANSFORMATION OF POISSON, BINOMIAL AND NEGATIVE-BINOMIAL DATA , 1948 .

[92]  Scot E. Dowd,et al.  Windows .NET Network Distributed Basic Local Alignment Search Toolkit (W.ND-BLAST) , 2005, BMC Bioinformatics.