Modelling of zero-inflation improves inference of metagenomic gene count data

Metagenomics enables the study of gene abundances in complex mixtures of microorganisms and has become a standard methodology for the analysis of the human microbiome. However, gene abundance data is inherently noisy and contains high levels of biological and technical variability as well as an excess of zeros due to non-detected genes. This makes the statistical analysis challenging. In this study, we present a new hierarchical Bayesian model for inference of metagenomic gene abundance data. The model uses a zero-inflated overdispersed Poisson distribution which is able to simultaneously capture the high gene-specific variability as well as zero observations in the data. By analysis of three comprehensive datasets, we show that zero-inflation is common in metagenomic data from the human gut and, if not correctly modelled, it can lead to substantial reductions in statistical power. We also show, by using resampled metagenomic data, that our model has, compared to other methods, a higher and more stable performance for detecting differentially abundant genes. We conclude that proper modelling of the gene-specific variability, including the excess of zeros, is necessary to accurately describe gene abundances in metagenomic data. The proposed model will thus pave the way for new biological insights into the structure of microbial communities.

[1]  Davide Heller,et al.  eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences , 2015, Nucleic Acids Res..

[2]  Mihai Pop,et al.  Statistical Methods for Detecting Differentially Abundant Features in Clinical Metagenomic Samples , 2009, PLoS Comput. Biol..

[3]  Susan P. Holmes,et al.  Waste Not, Want Not: Why Rarefying Microbiome Data Is Inadmissible , 2013, PLoS Comput. Biol..

[4]  Jo Handelsman,et al.  Metagenomics for studying unculturable microorganisms: cutting the Gordian knot , 2005, Genome Biology.

[5]  Jack A. Gilbert,et al.  Human and Environmental Impacts on River Sediment Microbial Communities , 2014, PloS one.

[6]  J. Harris,et al.  Zero-inflated negative binomial mixed model: an application to two microbial organisms important in oesophagitis , 2016, Epidemiology and Infection.

[7]  J. Aitchison,et al.  The multivariate Poisson-log normal distribution , 1989 .

[8]  M. Pop,et al.  Robust methods for differential abundance analysis in marker gene surveys , 2013, Nature Methods.

[9]  Charlotte Soneson,et al.  A comparison of methods for differential expression analysis of RNA-seq data , 2013, BMC Bioinformatics.

[10]  Martyn Plummer,et al.  JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling , 2003 .

[11]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[12]  Alison S. Waller,et al.  Assessment of Metagenomic Assembly Using Simulated Next Generation Sequencing Data , 2012, PloS one.

[13]  Xiaowei Xu,et al.  A structural approach for finding functional modules from large biological networks , 2008, BMC Bioinformatics.

[14]  P. Hugenholtz,et al.  Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes , 2013, Nature Biotechnology.

[15]  Erik Kristiansson,et al.  HirBin: high-resolution identification of differentially abundant functions in metagenomes , 2017, BMC Genomics.

[16]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[17]  E. Kristiansson,et al.  The structure and diversity of human, animal and environmental resistomes , 2016, Microbiome.

[18]  Luis Pedro Coelho,et al.  Structure and function of the global ocean microbiome , 2015, Science.

[19]  J. Clemente,et al.  Human gut microbiome viewed across age and geography , 2012, Nature.

[20]  Robert D. Finn,et al.  The Pfam protein families database: towards a more sustainable future , 2015, Nucleic Acids Res..

[21]  Hongzhe Li,et al.  A two-part mixed-effects model for analyzing longitudinal microbiome compositional data , 2016, Bioinform..

[22]  Andreas Wilke,et al.  phylogenetic and functional analysis of metagenomes , 2022 .

[23]  Erik Kristiansson,et al.  Variability in Metagenomic Count Data and Its Influence on the Identification of Differentially Abundant Genes , 2017, J. Comput. Biol..

[24]  Rick L. Stevens,et al.  Functional metagenomic profiling of nine biomes , 2008, Nature.

[25]  Qiang Feng,et al.  A metagenome-wide association study of gut microbiota in type 2 diabetes , 2012, Nature.

[26]  Léon Personnaz,et al.  Enrichment or depletion of a GO category within a class of genes: which test? , 2007, Bioinform..

[27]  Allyson L. Byrd,et al.  Biogeography and individuality shape function in the human skin metagenome , 2014, Nature.

[28]  A. W. van der Vaart,et al.  Bayesian analysis of RNA sequencing data by estimating multiple shrinkage priors. , 2013, Biostatistics.

[29]  William A. Walters,et al.  Experimental and analytical tools for studying the human microbiome , 2011, Nature Reviews Genetics.

[30]  Lingling An,et al.  A robust approach for identifying differentially abundant features in metagenomic samples , 2015, Bioinform..

[31]  Erin Beck,et al.  TIGRFAMs and Genome Properties in 2013 , 2012, Nucleic Acids Res..

[32]  Erik Kristiansson,et al.  Computational and Statistical Considerations in the Analysis of Metagenomic Data , 2018 .

[33]  Fredrik H. Karlsson,et al.  Gut metagenome in European women with normal, impaired and diabetic glucose control , 2013, Nature.

[34]  Rob Knight,et al.  Reconstructing the Microbial Diversity and Function of Pre-Agricultural Tallgrass Prairie Soils in the United States , 2013, Science.

[35]  G. Gloor,et al.  Human milk microbiota profiles in relation to birthing method, gestation and infant gender , 2016, Microbiome.

[36]  Gang Li,et al.  Zero-Inflated Beta Regression for Differential Abundance Analysis with Metagenomics Data , 2016, J. Comput. Biol..

[37]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[38]  M. B. Pereira,et al.  Comparison of normalization methods for the analysis of metagenomic gene abundance data , 2018, BMC Genomics.

[39]  Tapabrata Maiti,et al.  Bayesian Data Analysis (2nd ed.) (Book) , 2004 .

[40]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[41]  K. Pollard,et al.  Toward Accurate and Quantitative Comparative Metagenomics , 2016, Cell.

[42]  Rick L. Stevens,et al.  Unlocking the potential of metagenomics through replicated experimental design , 2012, Nature Biotechnology.

[43]  Erik Kristiansson,et al.  Statistical evaluation of methods for identification of differentially abundant genes in comparative metagenomics , 2016, BMC Genomics.

[44]  D. Hall Zero‐Inflated Poisson and Binomial Regression with Random Effects: A Case Study , 2000, Biometrics.

[45]  Andy H. Lee,et al.  Analysis of zero-inflated Poisson data incorporating extent of exposure , 2001 .

[46]  Ümit V. Çatalyürek,et al.  Benchmarking short sequence mapping tools , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine.

[47]  Katherine H. Huang,et al.  Structure, Function and Diversity of the Healthy Human Microbiome , 2012, Nature.

[48]  C. Mungall,et al.  Gene Ontology Consortium : going forward The Gene Ontology , 2015 .

[49]  P. Bork,et al.  A human gut microbial gene catalogue established by metagenomic sequencing , 2010, Nature.

[50]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[51]  M. Redinbo,et al.  The role of the microbiome in cancer development and therapy , 2017, CA: a cancer journal for clinicians.

[52]  J. Handelsman,et al.  Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products. , 1998, Chemistry & biology.

[53]  E. Kristiansson,et al.  Tentacle: distributed quantification of genes in metagenomes , 2015, GigaScience.

[54]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..