A Statistical Perspective on the Challenges in Molecular Microbial Biology

High throughput sequencing (HTS)-based technology enables identifying and quantifying non-culturable microbial organisms in all environments. Microbial sequences have enhanced our understanding of the human microbiome, the soil and plant environment, and the marine environment. All molecular microbial data pose statistical challenges due to contamination sequences from reagents, batch effects, unequal sampling, and undetected taxa. Technical biases and heteroscedasticity have the strongest effects, but different strains across subjects and environments also make direct differential abundance testing unwieldy. We provide an introduction to a few statistical tools that can overcome some of these difficulties and demonstrate those tools on an example. We show how standard statistical methods, such as simple hierarchical mixture and topic models, can facilitate inferences on latent microbial communities. We also review some nonparametric Bayesian approaches that combine visualization and uncertainty quantification. The intersection of molecular microbial biology and statistics is an exciting new venue. Finally, we list some of the important open problems that would benefit from more careful statistical method development.

[1]  Sivan Bercovici,et al.  Liquid biopsy for infectious diseases: Sequencing of cell-free plasma to detect pathogen DNA in patients with invasive fungal disease , 2018, bioRxiv.

[2]  S. Fienberg When did Bayesian inference become "Bayesian"? , 2006 .

[3]  C. Huttenhower,et al.  Relating the metatranscriptome and metagenome of the human gut , 2014, Proceedings of the National Academy of Sciences.

[4]  Kris Sankaran,et al.  Multitable Methods for Microbiome Data Integration , 2019, Front. Genet..

[5]  Paul J. McMurdie,et al.  Bioconductor Workflow for Microbiome Data Analysis: from raw reads to community analyses , 2016, F1000Research.

[6]  L. Excoffier,et al.  Analysis of molecular variance inferred from metric distances among DNA haplotypes: application to human mitochondrial DNA restriction data. , 1992, Genetics.

[7]  D. Raoult,et al.  How mass spectrometric approaches applied to bacterial identification have revolutionized the study of human gut microbiota , 2018, Expert review of proteomics.

[8]  L. Trippa,et al.  Bayesian Nonparametric Ordination for the Analysis of Microbial Communities , 2016, Journal of the American Statistical Association.

[9]  Zheng Wang,et al.  Genome-wide association studies dissect the genetic networks underlying agronomical traits in soybean , 2017, Genome Biology.

[10]  Susan Holmes,et al.  phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data , 2013, PloS one.

[11]  Brian J. Reich,et al.  MIMIX: A Bayesian Mixed-Effects Model for Microbiome Data From Designed Experiments , 2017, Journal of the American Statistical Association.

[12]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[13]  F. J. Anscombe,et al.  THE TRANSFORMATION OF POISSON, BINOMIAL AND NEGATIVE-BINOMIAL DATA , 1948 .

[14]  Jiqiang Guo,et al.  Stan: A Probabilistic Programming Language. , 2017, Journal of statistical software.

[15]  Lawrence A. David,et al.  A phylogenetic transform enhances analysis of compositional microbiota data , 2016, bioRxiv.

[16]  David A. Baltrus,et al.  Chloroplast sequence variation and the efficacy of peptide nucleic acids for blocking host amplification in plant microbiome studies , 2018, Microbiome.

[17]  Rick L. Stevens,et al.  A communal catalogue reveals Earth’s multiscale microbial diversity , 2017, Nature.

[18]  P. Gajer,et al.  The vaginal microbiota of pregnant women who subsequently have spontaneous preterm labor and delivery and those with a normal delivery at term , 2014, Microbiome.

[19]  M. Greenacre Compositional Data and Correspondence Analysis , 2011 .

[20]  D. Relman,et al.  Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data , 2017, Microbiome.

[21]  Susan P. Holmes,et al.  Shiny-phyloseq: Web application for interactive microbiome analysis with provenance tracking , 2014, Bioinform..

[22]  J. Clemente,et al.  Human gut microbiome viewed across age and geography , 2012, Nature.

[23]  Gail L. Rosen,et al.  NBC: the Naïve Bayes Classification tool webserver for taxonomic classification of metagenomic reads , 2010, Bioinform..

[24]  Susan Holmes,et al.  Interactive Visualization of Hierarchically Structured Data , 2018, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[25]  Susan P. Holmes,et al.  Bayesian Unidimensional Scaling for visualizing uncertainty in high dimensional datasets with latent ordering of observations , 2017, BMC Bioinformatics.

[26]  D. Relman,et al.  The Landscape Ecology and Microbiota of the Human Nose, Mouth, and Throat. , 2017, Cell host & microbe.

[27]  Christine L. Sun,et al.  Temporal and spatial variation of the human microbiota during pregnancy , 2015, Proceedings of the National Academy of Sciences.

[28]  T. Crowther,et al.  Detecting macroecological patterns in bacterial communities across independent studies of 1 global soils 2 , 2017 .

[29]  Jean M. Macklaim,et al.  Microbiome Datasets Are Compositional: And This Is Not Optional , 2017, Front. Microbiol..

[30]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[31]  Kris Sankaran,et al.  Latent variable modeling for the microbiome. , 2017, Biostatistics.

[32]  Peter B Reich,et al.  Microbial diversity drives multifunctionality in terrestrial ecosystems , 2016, Nature Communications.

[33]  J. Huisman,et al.  Scientists’ warning to humanity: microorganisms and climate change , 2019, Nature Reviews Microbiology.

[34]  Rob Knight,et al.  The Earth Microbiome project: successes and aspirations , 2014, BMC Biology.

[35]  Susan P. Holmes,et al.  Waste Not , Want Not : Why Rarefying Microbiome Data is Inadmissible . October 1 , 2013 , 2013 .

[36]  Gordon K. Smyth,et al.  limma: Linear Models for Microarray Data , 2005 .

[37]  Ian Holmes,et al.  Linking Statistical and Ecological Theory: Hubbell's Unified Neutral Theory of Biodiversity as a Hierarchical Dirichlet Process , 2014, Proceedings of the IEEE.

[38]  M. Greenacre Correspondence analysis of raw data. , 2010, Ecology.

[39]  D. Rubin,et al.  Inference from Iterative Simulation Using Multiple Sequences , 1992 .

[40]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[41]  S. Holmes,et al.  Bioconductor Workflow for Microbiome Data Analysis: from raw reads to community analyses , 2016, F1000Research.

[42]  Susan M. Huse,et al.  Interpreting Prevotella and Bacteroides as biomarkers of diet and lifestyle , 2016, Microbiome.

[43]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[44]  Tommi Vatanen,et al.  The dynamics of the human infant gut microbiome in development and in progression toward type 1 diabetes. , 2015, Cell host & microbe.

[45]  Paul J. McMurdie,et al.  Exact sequence variants should replace operational taxonomic units in marker-gene data analysis , 2017, The ISME Journal.

[46]  Thomas P. Quinn,et al.  Understanding sequencing data as compositions: an outlook and review , 2017, bioRxiv.

[47]  Aki Vehtari,et al.  Rank-Normalization, Folding, and Localization: An Improved Rˆ for Assessing Convergence of MCMC (with Discussion) , 2019, Bayesian Analysis.

[48]  Christine L. Sun,et al.  Replication and refinement of a vaginal microbial signature of preterm birth in two racially distinct cohorts of US women , 2017, Proceedings of the National Academy of Sciences.

[49]  D. Relman,et al.  A spatial gradient of bacterial diversity in the human oral cavity shaped by salivary flow , 2018, Nature Communications.

[50]  Johannes Alneberg,et al.  DESMAN: a new tool for de novo extraction of strains from metagenomes , 2017, Genome Biology.

[51]  Elizabeth Purdom,et al.  Analysis of a data matrix and a graph: Metagenomic data and the phylogenetic tree , 2011, 1202.5880.

[52]  Marti J. Anderson,et al.  A new method for non-parametric multivariate analysis of variance in ecology , 2001 .

[53]  D. Chessel,et al.  From dissimilarities among species to dissimilarities among communities: a double principal coordinate analysis. , 2004, Journal of theoretical biology.

[54]  Paul Turner,et al.  Reagent and laboratory contamination can critically impact sequence-based microbiome analyses , 2014, BMC Biology.

[55]  A. von Haeseler,et al.  Next-generation sequencing diagnostics of bacteremia in septic patients , 2016, Genome Medicine.

[56]  T. Snijders,et al.  Estimation and Prediction for Stochastic Blockmodels for Graphs with Latent Block Structure , 1997 .

[57]  Wei Xu,et al.  Assessment and Selection of Competing Models for Zero-Inflated Microbiome Data , 2015, PloS one.

[58]  Marti J. Anderson,et al.  Generalized discriminant analysis based on distances , 2003 .

[59]  Roberto Romero,et al.  The composition and stability of the vaginal microbiota of normal pregnant women is different from that of non-pregnant women , 2014, Microbiome.

[60]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[61]  James R. Cole,et al.  The Ribosomal Database Project: improved alignments and new tools for rRNA analysis , 2008, Nucleic Acids Res..

[62]  Bailey K. Fosdick,et al.  Modern Statistics for Modern Biology , 2020 .

[63]  Paul J. McMurdie,et al.  DADA2: High resolution sample inference from Illumina amplicon data , 2016, Nature Methods.

[64]  A. Gelman,et al.  Rank-normalization, folding, and localization: An improved R-hat for assessing convergence Rank-Normalization, Folding, and Localization: An Improved (cid:2) R for Assessing Convergence of MCMC An assessing for assessing An improved (cid:2) R for assessing convergence of MCMC , 2020 .

[65]  N. Segata,et al.  Shotgun metagenomics, from sampling to analysis , 2017, Nature Biotechnology.

[66]  Hongzhe Li,et al.  Testing in Microbiome-Profiling Studies with MiRKAT, the Microbiome Regression-Based Kernel Association Test. , 2015, American journal of human genetics.

[67]  R. Knight,et al.  UniFrac: a New Phylogenetic Method for Comparing Microbial Communities , 2005, Applied and Environmental Microbiology.

[68]  Susan Holmes,et al.  Ten quick tips for effective dimensionality reduction , 2019, PLoS Comput. Biol..

[69]  Jean-Philippe Vert,et al.  Continuous embeddings of DNA sequencing reads, and application to metagenomics , 2018, bioRxiv.

[70]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[71]  Benjamin J Callahan,et al.  Consistent and correctable bias in metagenomic sequencing experiments , 2019, eLife.

[72]  Susan Holmes,et al.  Multivariate data analysis: The French way , 2008, 0805.2879.

[73]  Kris Sankaran,et al.  structSSI: Simultaneous and Selective Inference for Grouped or Hierarchically Structured Data. , 2014, Journal of statistical software.

[74]  David B. Dunson,et al.  Probabilistic topic models , 2012, Commun. ACM.

[75]  Christian L. Müller,et al.  Sparse and Compositionally Robust Inference of Microbial Ecological Networks , 2014, PLoS Comput. Biol..

[76]  James T. Morton,et al.  Phylofactorization: a graph partitioning algorithm to identify phylogenetic scales of ecological data , 2019, Ecological Monographs.

[77]  Rayan Chikhi,et al.  Metagenomics Strain Resolution on Assembly Graphs , 2020, bioRxiv.

[78]  Steven Salzberg,et al.  Bracken: Estimating species abundance in metagenomics data , 2016, bioRxiv.

[79]  W. Huber,et al.  Differential expression analysis for sequence count data , 2010 .

[80]  C. Quince,et al.  Dirichlet Multinomial Mixtures: Generative Models for Microbial Metagenomics , 2012, PloS one.

[81]  Marti J. Anderson,et al.  Permutational Multivariate Analysis of Variance (PERMANOVA) , 2017 .

[82]  W. Ludwig,et al.  SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB , 2007, Nucleic acids research.

[83]  S. Evans,et al.  The phylogenetic Kantorovich–Rubinstein metric for environmental sequence samples , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[84]  P. Khatri,et al.  Combined use of metagenomic sequencing and host response profiling for the diagnosis of suspected sepsis , 2019 .

[85]  Eoin L. Brodie,et al.  Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB , 2006, Applied and Environmental Microbiology.

[86]  E. Salido,et al.  CRISPR/Cas9-mediated glycolate oxidase disruption is an efficacious and safe treatment for primary hyperoxaluria type I , 2018, Nature Communications.

[87]  Ana-Maria Staicu,et al.  A nonparametric spatial test to identify factors that shape a microbiome , 2018, 1806.06297.

[88]  C. Huttenhower,et al.  Metagenomic biomarker discovery and explanation , 2011, Genome Biology.

[89]  Sharmila S. Mande,et al.  Visual exploration of microbiome data , 2019, Journal of Biosciences.

[90]  Se Jin Song,et al.  The treatment-naive microbiome in new-onset Crohn's disease. , 2014, Cell host & microbe.

[91]  Susan P. Holmes,et al.  Comparisons of Distance Methods for Combining Covariates and Abundances in Microbiome Studies , 2011, Pacific Symposium on Biocomputing.

[92]  Michael Greenacre,et al.  Log-Ratio Analysis Is a Limiting Case of Correspondence Analysis , 2009 .

[93]  Julia Fukuyama,et al.  Adaptive gPCA: A method for structured dimensionality reduction with applications to microbiome data , 2017, The Annals of Applied Statistics.

[94]  A. Sessitsch,et al.  A review on the plant microbiome: Ecology, functions, and emerging trends in microbial application , 2019, Journal of advanced research.

[95]  John K. Kruschke,et al.  Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan , 2014 .

[96]  D. Relman,et al.  The Block Bootstrap Method for Longitudinal Microbiome Data , 2018, 1809.01832.

[97]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[98]  Susan P. Holmes,et al.  Multidomain analyses of a longitudinal human microbiome intestinal cleanout perturbation experiment , 2017, PLoS Comput. Biol..