Large Covariance Estimation for Compositional Data Via Composition-Adjusted Thresholding

ABSTRACT High-dimensional compositional data arise naturally in many applications such as metagenomic data analysis. The observed data lie in a high-dimensional simplex, and conventional statistical methods often fail to produce sensible results due to the unit-sum constraint. In this article, we address the problem of covariance estimation for high-dimensional compositional data and introduce a composition-adjusted thresholding (COAT) method under the assumption that the basis covariance matrix is sparse. Our method is based on a decomposition relating the compositional covariance to the basis covariance, which is approximately identifiable as the dimensionality tends to infinity. The resulting procedure can be viewed as thresholding the sample centered log-ratio covariance matrix and hence is scalable for large covariance matrices. We rigorously characterize the identifiability of the covariance parameters, derive rates of convergence under the spectral norm, and provide theoretical guarantees on support recovery. Simulation studies demonstrate that the COAT estimator outperforms some existing optimization-based estimators. We apply the proposed method to the analysis of a microbiome dataset to understand the dependence structure among bacterial taxa in the human gut.

[1]  Noureddine El Karoui,et al.  Operator norm consistent estimation of large-dimensional sparse covariance matrices , 2008, 0901.3220.

[2]  E. Mardis,et al.  An obesity-associated gut microbiome with increased capacity for energy harvest , 2006, Nature.

[3]  Hongzhe Li Microbiome, Metagenomics, and High-Dimensional Compositional Data Analysis , 2015 .

[4]  Lingling An,et al.  Investigating microbial co-occurrence patterns based on metagenomic compositional data , 2015, Bioinform..

[5]  W. Stahel,et al.  Log-normal Distributions across the Sciences: Keys and Clues , 2001 .

[6]  F. Bushman,et al.  Intestinal microbiota metabolism of L-carnitine, a nutrient in red meat, promotes atherosclerosis , 2013, Nature Medicine.

[7]  Weidong Liu,et al.  Adaptive Thresholding for Sparse Covariance Matrix Estimation , 2011, 1102.2237.

[8]  J. Schmee An Introduction to Multivariate Statistical Analysis , 1986 .

[9]  D. M. Titterington Logistic-Normal Distribution , 2014 .

[10]  Curtis Huttenhower,et al.  Microbial Co-occurrence Relationships in the Human Microbiome , 2012, PLoS Comput. Biol..

[11]  R. Rosenfeld Nature , 2009, Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery.

[12]  Eric Z. Chen,et al.  Inflammation, Antibiotics, and Diet as Environmental Stressors of the Gut Microbiome in Pediatric Crohn's Disease. , 2015, Cell host & microbe.

[13]  M. Rejmánek,et al.  Connectance in real biotic communities and critical values for stability of model ecosystems , 1979, Nature.

[14]  Sharon I. Greenblum,et al.  Metagenomic systems biology of the human gut microbiome reveals topological shifts associated with obesity and inflammatory bowel disease , 2011, Proceedings of the National Academy of Sciences.

[15]  Harrison H. Zhou,et al.  OPTIMAL RATES OF CONVERGENCE FOR SPARSE COVARIANCE MATRIX ESTIMATION , 2012, 1302.3030.

[16]  Neo D. Martinez,et al.  Food-web structure and network theory: The role of connectance and size , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[18]  Jonathan Friedman,et al.  Inferring Correlation Networks from Genomic Survey Data , 2012, PLoS Comput. Biol..

[19]  Jianqing Fan,et al.  Large covariance estimation by thresholding principal orthogonal complements , 2011, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[20]  Hongyu Zhao,et al.  CCLasso: correlation inference for compositional data through Lasso , 2015, Bioinform..

[21]  L. Isserlis ON A FORMULA FOR THE PRODUCT-MOMENT COEFFICIENT OF ANY ORDER OF A NORMAL FREQUENCY DISTRIBUTION IN ANY NUMBER OF VARIABLES , 1918 .

[22]  Melissa J. Morine,et al.  Diversity of key players in the microbial ecosystems of the human body , 2015, Scientific Reports.

[23]  P. Bickel,et al.  Covariance regularization by thresholding , 2009, 0901.3079.

[24]  P. Yodzis,et al.  The connectance of real ecosystems , 1980, Nature.

[25]  B. Roe,et al.  A core gut microbiome in obese and lean twins , 2008, Nature.

[26]  Charles K. Fisher,et al.  Identifying Keystone Species in the Human Gut Microbiome from Metagenomic Timeseries Using Sparse Linear Regression , 2014, PloS one.

[27]  Adam J. Rothman,et al.  Generalized Thresholding of Large Covariance Matrices , 2009 .

[28]  P. Bork,et al.  Enterotypes of the human gut microbiome , 2011, Nature.

[29]  Katherine H. Huang,et al.  A framework for human microbiome research , 2012, Nature.

[30]  J. Atchison,et al.  Logistic-normal distributions:Some properties and uses , 1980 .

[31]  S. Shen,et al.  The statistical analysis of compositional data , 1983 .

[32]  F. Bushman,et al.  Linking Long-Term Dietary Patterns with Gut Microbial Enterotypes , 2011, Science.

[33]  K. Foster,et al.  The ecology of the microbiome: Networks, competition, and stability , 2015, Science.

[34]  John Aitchison,et al.  The Statistical Analysis of Compositional Data , 1986 .

[35]  Jianqing Fan,et al.  High dimensional covariance matrix estimation using a factor model , 2007, math/0701124.