Exploring thematic structure and predicted functionality of 16S rRNA amplicon data

Analysis of microbiome data involves identifying co-occurring groups of taxa associated with sample features of interest (e.g., disease state). Elucidating such relations is often difficult as microbiome data are compositional, sparse, and have high dimensionality. Also, the configuration of co-occurring taxa may represent overlapping subcommunities that contribute to sample characteristics such as host status. Preserving the configuration of co-occurring microbes rather than detecting specific indicator species is more likely to facilitate biologically meaningful interpretations. Additionally, analyses that use taxonomic relative abundances to predict the abundances of different gene functions aggregate predicted functional profiles across taxa. This precludes straightforward identification of predicted functional components associated with subsets of co-occurring taxa. We provide an approach to explore co-occurring taxa using “topics” generated via a topic model and link these topics to specific sample features (e.g., disease state). Rather than inferring predicted functional content based on overall taxonomic relative abundances, we instead focus on inference of functional content within topics, which we parse by estimating interactions between topics and pathways through a multilevel, fully Bayesian regression model. We apply our methods to three publicly available 16S amplicon sequencing datasets: an inflammatory bowel disease dataset, an oral cancer dataset, and a time-series dataset. Using our topic model approach to uncover latent structure in 16S rRNA amplicon surveys, investigators can (1) capture groups of co-occurring taxa termed topics; (2) uncover within-topic functional potential; (3) link taxa co-occurrence, gene function, and environmental/host features; and (4) explore the way in which sets of co-occurring taxa behave and evolve over time. These methods have been implemented in a freely available R package: https://cran.r-project.org/package=themetagenomics, https://github.com/EESI/themetagenomics.

[1]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[2]  Hongzhe Li Microbiome, Metagenomics, and High-Dimensional Compositional Data Analysis , 2015 .

[3]  John D. Lafferty,et al.  A correlated topic model of Science , 2007, 0708.3601.

[4]  R. Beiko,et al.  Phylogenetic approaches to microbial community classification , 2015, Microbiome.

[5]  D. Rubin,et al.  Inference from Iterative Simulation Using Multiple Sequences , 1992 .

[6]  Se Jin Song,et al.  The treatment-naive microbiome in new-onset Crohn's disease. , 2014, Cell host & microbe.

[7]  John Shawe-Taylor,et al.  Sparse canonical correlation analysis , 2009, Machine Learning.

[8]  Shumpei Niida,et al.  Estrogen Regulates the Production of VEGF for Osteoclast Formation and Activity in op/op Mice , 2003, Journal of bone and mineral research : the official journal of the American Society for Bone and Mineral Research.

[9]  John D. Lafferty,et al.  Correction: A correlated topic model of Science , 2007, 0712.1486.

[10]  Andrew McCallum,et al.  Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression , 2008, UAI.

[11]  Susumu Goto,et al.  KEGG for integration and interpretation of large-scale molecular data sets , 2011, Nucleic Acids Res..

[12]  P. Countway,et al.  The power and pitfalls of Dirichlet-multinomial mixture models for ecological count data , 2016, bioRxiv.

[13]  Eric J Alm,et al.  Host lifestyle affects human microbiota on daily timescales , 2014, Genome Biology.

[14]  Patricia Corby,et al.  Correction: Changes in Abundance of Oral Microbiota Associated with Oral Cancer , 2014, PLoS ONE.

[15]  Adam B. Olshen,et al.  Changes in Abundance of Oral Microbiota Associated with Oral Cancer , 2014, PloS one.

[16]  Sang-Bing Tsai,et al.  EO-Performance relationships in Reverse Internationalization by Chinese Global Startup OEMs: Social Networks and Strategic Flexibility , 2016, PloS one.

[17]  Paul J. McMurdie,et al.  DADA2: High resolution sample inference from Illumina amplicon data , 2016, Nature Methods.

[18]  Susan P. Holmes,et al.  Waste Not , Want Not : Why Rarefying Microbiome Data is Inadmissible . October 1 , 2013 , 2013 .

[19]  Perry De Valpine,et al.  General models for resource use or other compositional count data using the Dirichlet-multinomial distribution. , 2013, Ecology.

[20]  Hong Gu,et al.  BioMiCo: a supervised Bayesian model for inference of microbial community structure , 2015, Microbiome.

[21]  Robert C. Edgar,et al.  SINAPS: Prediction of microbial traits from marker gene sequences , 2017, bioRxiv.

[22]  James T. Morton,et al.  Microbiome-wide association studies link dynamic microbial consortia to disease , 2016, Nature.

[23]  Christian L. Müller,et al.  Sparse and Compositionally Robust Inference of Microbial Ecological Networks , 2014, PLoS Comput. Biol..

[24]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[25]  Susumu Goto,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 2000, Nucleic Acids Res..

[26]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[27]  Margaret E. Roberts,et al.  stm: An R Package for Structural Topic Models , 2019, Journal of Statistical Software.

[28]  Michael I. Love,et al.  Differential analysis of count data – the DESeq2 package , 2013 .

[29]  Jens Timmer,et al.  Summary of the DREAM8 Parameter Estimation Challenge: Toward Parameter Identification for Whole-Cell Models , 2015, PLoS Comput. Biol..

[30]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[31]  C. Quince,et al.  Dirichlet Multinomial Mixtures: Generative Models for Microbial Metagenomics , 2012, PloS one.

[32]  Peter Meinicke,et al.  Tax4Fun: predicting functional profiles from metagenomic 16S rRNA data , 2015, Bioinform..

[33]  Jesse R. Zaneveld,et al.  Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences , 2013, Nature Biotechnology.

[34]  S. Holmes,et al.  Bioconductor Workflow for Microbiome Data Analysis: from raw reads to community analyses , 2016, F1000Research.

[35]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[36]  L. Trippa,et al.  Bayesian Nonparametric Ordination for the Analysis of Microbial Communities , 2016, Journal of the American Statistical Association.

[37]  R. Knight,et al.  Supervised classification of human microbiota. , 2011, FEMS microbiology reviews.

[38]  Jonathan A. Eisen,et al.  Incorporating 16S Gene Copy Number Information Improves Estimates of Microbial Diversity and Abundance , 2012, PLoS Comput. Biol..

[39]  Eric J. Alm,et al.  Erratum to: Host lifestyle affects human microbiota on daily timescales , 2016, Genome Biology.

[40]  Brian L. Schmidt,et al.  Piphillin: Improved Prediction of Metagenomic Content by Direct Inference from Human Microbiomes , 2016, PloS one.

[41]  David G. Rand,et al.  Structural Topic Models for Open‐Ended Survey Responses , 2014, American Journal of Political Science.

[42]  Rob Knight,et al.  Bayesian community-wide culture-independent microbial source tracking , 2011, Nature Methods.

[43]  Xin Chen,et al.  Identifying enterotype in human microbiome by decomposing probabilistic topics into components , 2012, 2012 IEEE International Conference on Bioinformatics and Biomedicine.