Shrinkage improves estimation of microbial associations under different normalization methods

Consistent estimation of associations in microbial genomic survey count data is fundamental to microbiome research. Technical limitations, including compositionality, low sample sizes, and technical variability, obstruct standard application of association measures and require data normalization prior to estimating associations. Here, we investigate the interplay between data normalization and microbial association estimation by a comprehensive analysis of statistical consistency. Leveraging the large sample size of the American Gut Project (AGP), we assess the consistency of the two prominent linear association estimators, correlation and proportionality, under different sample scenarios and data normalization schemes, including RNA-seq analysis work flows and log-ratio transformations. We show that shrinkage estimation, a standard technique in high-dimensional statistics, can universally improve the quality of association estimates for microbiome data. We find that large-scale association patterns in the AGP data can be grouped into five normalization-dependent classes. Using microbial association network construction and clustering as examples of exploratory data analysis, we show that variance-stabilizing and log-ratio approaches provide for the most consistent estimation of taxonomic and structural coherence. Taken together, the findings from our reproducible analysis workflow have important implications for microbiome studies in multiple stages of analysis, particularly when only small sample sizes are available.

[1]  C. Huttenhower,et al.  Cross-biome comparison of microbial association networks , 2015, Front. Microbiol..

[2]  Jürg Bähler,et al.  Proportionality: A Valid Alternative to Correlation for Relative Data , 2014, bioRxiv.

[3]  Anru R. Zhang,et al.  Multisample estimation of bacterial composition matrices in metagenomics data , 2020 .

[4]  R. Heller,et al.  Testing for differential abundance in compositional counts data, with application to microbiome studies , 2019, The Annals of Applied Statistics.

[5]  Gábor Csárdi,et al.  The igraph software package for complex network research , 2006 .

[6]  M. Pop,et al.  Robust methods for differential abundance analysis in marker gene surveys , 2013, Nature Methods.

[7]  K. Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics , 2011 .

[8]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[9]  Harald Binder,et al.  Transforming RNA-Seq Data to Improve the Performance of Prognostic Gene Signatures , 2014, PloS one.

[10]  Raymond J. Carroll,et al.  Sparse semiparametric canonical correlation analysis for data of mixed types. , 2018, Biometrika.

[11]  K. Scott,et al.  Manipulating the gut microbiota to maintain health and treat disease , 2015, Microbial ecology in health and disease.

[12]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[13]  K. Pearson Mathematical contributions to the theory of evolution.—On a form of spurious correlation which may arise when indices are used in the measurement of organs , 1897, Proceedings of the Royal Society of London.

[14]  A. Butte,et al.  Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Cédric Notredame,et al.  How should we measure proportionality on relative gene expression data? , 2016, Theory in Biosciences.

[16]  Bailey K. Fosdick,et al.  Modern Statistics for Modern Biology , 2020 .

[17]  Stephanie C. Hicks,et al.  Analysis and correction of compositional bias in sparse sequencing count data , 2017, BMC Genomics.

[18]  Jun Wang,et al.  Boolean analysis reveals systematic interactions among low-abundance species in the human gut microbiome , 2017, PLoS Comput. Biol..

[19]  G. Borisy,et al.  Spatial organization of a model 15-member human gut microbiota established in gnotobiotic mice , 2017, Proceedings of the National Academy of Sciences.

[20]  C. Stein Estimation of the Mean of a Multivariate Normal Distribution , 1981 .

[21]  Christian L. Müller,et al.  Sparse and Compositionally Robust Inference of Microbial Ecological Networks , 2014, PLoS Comput. Biol..

[22]  Olivier Ledoit,et al.  Honey, I Shrunk the Sample Covariance Matrix , 2003 .

[23]  Sophie J. Weiss,et al.  Correlation detection strategies in microbial data sets vary widely in sensitivity and precision , 2016, The ISME Journal.

[24]  Rob Knight,et al.  American Gut: an Open Platform for Citizen Science Microbiome Research , 2018, mSystems.

[25]  David R. Lovell,et al.  propr: An R-package for Identifying Proportionally Abundant Features Using Compositional Data Analysis , 2017, Scientific Reports.

[26]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[27]  Jean M. Macklaim,et al.  Microbiome Datasets Are Compositional: And This Is Not Optional , 2017, Front. Microbiol..

[28]  Curtis Huttenhower,et al.  Microbial Co-occurrence Relationships in the Human Microbiome , 2012, PLoS Comput. Biol..

[29]  Susan P. Holmes,et al.  Waste Not , Want Not : Why Rarefying Microbiome Data is Inadmissible . October 1 , 2013 , 2013 .

[30]  Sandrine Dudoit,et al.  Normalizing single-cell RNA sequencing data: challenges and opportunities , 2017, Nature Methods.

[31]  Paul J. McMurdie,et al.  DADA2: High resolution sample inference from Illumina amplicon data , 2016, Nature Methods.

[32]  Bryan D. Martin,et al.  DivNet: Estimating diversity in networked communities , 2018, bioRxiv.

[33]  A. Leslie Robb,et al.  Alternative Transformations to Handle Extreme Values of the Dependent Variable , 1988 .

[34]  A. U.S.,et al.  Sparse Estimation of a Covariance Matrix , 2010 .

[35]  Vladimir Jojic,et al.  Learning Microbial Interaction Networks from Metagenomic Count Data , 2014, J. Comput. Biol..

[36]  Bryan D. Martin,et al.  MODELING MICROBIAL ABUNDANCES AND DYSBIOSIS WITH BETA-BINOMIAL REGRESSION. , 2019, The annals of applied statistics.

[37]  Rainer Spang,et al.  Adjusting microbiome profiles for differences in microbial load by spike-in bacteria , 2016, Microbiome.

[38]  C. Huttenhower,et al.  Expansion of intestinal Prevotella copri correlates with enhanced susceptibility to arthritis , 2013, eLife.

[39]  H. Ozcelik,et al.  Correlation matrix distance, a meaningful measure for evaluation of non-stationary MIMO channels , 2005, 2005 IEEE 61st Vehicular Technology Conference.

[40]  Jonathan Friedman,et al.  Inferring Correlation Networks from Genomic Survey Data , 2012, PLoS Comput. Biol..

[41]  Jesse R. Zaneveld,et al.  Normalization and microbial differential abundance strategies depend upon data characteristics , 2017, Microbiome.

[42]  Anne-Laure Boulesteix,et al.  Comments on: Augmenting the bootstrap to analyze high dimensional genomic data , 2008 .

[43]  Javier Palarea-Albaladejo,et al.  zCompositions — R package for multivariate imputation of left-censored data under a compositional approach , 2015 .

[44]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[45]  Vanni Bucci,et al.  MDSINE: Microbial Dynamical Systems INference Engine for microbiome time-series analyses , 2016, Genome Biology.

[46]  John Aitchison,et al.  The Statistical Analysis of Compositional Data , 1986 .

[47]  Adam J. Rothman,et al.  Sparse permutation invariant covariance estimation , 2008, 0801.4837.

[48]  M. Newman,et al.  Finding community structure in very large networks. , 2004, Physical review. E, Statistical, nonlinear, and soft matter physics.

[49]  Richard Bonneau,et al.  Disentangling microbial associations from hidden environmental and technical factors via latent graphical models , 2019, bioRxiv.

[50]  Mihai Pop,et al.  Robust methods for differential abundance analysis in marker gene surveys , 2013, Nature Methods.

[51]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[52]  Christian L. Müller,et al.  Microbial Networks in SPRING - Semi-parametric Rank-Based Correlation and Partial Correlation Estimation for Quantitative Microbiome Data , 2019, bioRxiv.

[53]  M. Newman,et al.  Mixing patterns in networks. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[54]  Jesse R. Zaneveld,et al.  Identifying genomic and metabolic features that can underlie early successional and opportunistic lifestyles of human gut symbionts , 2012, Genome research.

[55]  G. Braus,et al.  One Juliet and four Romeos: VeA and its methyltransferases , 2015, Front. Microbiol..

[56]  Martin Hartmann,et al.  Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities , 2009, Applied and Environmental Microbiology.