Regression Analysis for Microbiome Compositional Data

One important problem in microbiome analysis is to identify the bacterial taxa that are associated with a response, where the microbiome data are summarized as the composition of the bacterial taxa at different taxonomic levels. This paper considers regression analysis with such compositional data as covariates. In order to satisfy the subcompositional coherence of the results, linear models with a set of linear constraints on the regression coefficients are introduced. Such models allow regression analysis for subcompositions and include the log-contrast model for compositional covariates as a special case. A penalized estimation procedure for estimating the regression coefficients and for selecting variables under the linear constraints is developed. A method is also proposed to obtain de-biased estimates of the regression coefficients that are asymptotically unbiased and have a joint asymptotic multivariate normal distribution. This provides valid confidence intervals of the regression coefficients and can be used to obtain the $p$-values. Simulation results show the validity of the confidence intervals and smaller variances of the de-biased estimates when the linear constraints are imposed. The proposed methods are applied to a gut microbiome data set and identify four bacterial genera that are associated with the body mass index after adjusting for the total fat and caloric intakes.

[1]  Ronald D. Snee Techniques for the Analysis of Mixture Data , 1973 .

[2]  Dimitri P. Bertsekas,et al.  Constrained Optimization and Lagrange Multiplier Methods , 1982 .

[3]  J. Aitchison,et al.  Log contrast models for experiments with mixtures , 1984 .

[4]  John Aitchison,et al.  The Statistical Analysis of Compositional Data , 1986 .

[5]  Lynne B. Hare,et al.  Experiments with Mixtures: Designs, Models and the Analysis of Mixture Data, 2nd Ed. , 1991 .

[6]  M. Peruggia Experiments with Mixtures: Designs, Models, and the Analysis of Mixture Data , 2003 .

[7]  F. Bäckhed,et al.  Obesity alters gut microbial ecology. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Greg Kochanski,et al.  Confidence Intervals and Hypothesis Testing. ∗ 1 What is a Hypothesis Test , 2022 .

[9]  E. Mardis,et al.  An obesity-associated gut microbiome with increased capacity for energy harvest , 2006, Nature.

[10]  P. Turnbaugh,et al.  Microbial ecology: Human gut microbes associated with obesity , 2006, Nature.

[11]  Alexander F. Auch,et al.  MEGAN analysis of metagenomic data. , 2007, Genome research.

[12]  R. Knight,et al.  The Human Microbiome Project , 2007, Nature.

[13]  D. Bessesen,et al.  Human gut microbes associated with obesity , 2007 .

[14]  P. Bork,et al.  A human gut microbial gene catalogue established by metagenomic sequencing , 2010, Nature.

[15]  Cun-Hui Zhang,et al.  Confidence intervals for low dimensional parameters in high dimensional linear models , 2011, 1110.2563.

[16]  Cun-Hui Zhang,et al.  Scaled sparse linear regression , 2011, 1104.4595.

[17]  J. Parkhill,et al.  Dominant and diet-responsive groups of bacteria within the human colonic microbiota , 2011, The ISME Journal.

[18]  F. Bushman,et al.  Linking Long-Term Dietary Patterns with Gut Microbial Enterotypes , 2011, Science.

[19]  Sara van de Geer,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[20]  C. Huttenhower,et al.  Metagenomic microbial community profiling using unique clade-specific marker genes , 2012, Nature Methods.

[21]  Ian D. Caterson,et al.  Increased Gut Permeability and Microbiota Change Associate with Mesenteric Fat Inflammation and Metabolic Dysfunction in Diet-Induced Obese Mice , 2012, PloS one.

[22]  Qiang Feng,et al.  A metagenome-wide association study of gut microbiota in type 2 diabetes , 2012, Nature.

[23]  Peter Buhlmann Statistical significance in high-dimensional linear models , 2012, 1202.1377.

[24]  Francisco Guarner,et al.  The gut microbiota in IBD , 2012, Nature Reviews Gastroenterology &Hepatology.

[25]  Dennis L. Sun,et al.  Exact post-selection inference, with application to the lasso , 2013, 1311.6238.

[26]  Gareth M. James,et al.  Penalized and Constrained Regression , 2013 .

[27]  Anru Zhang,et al.  Compressed Sensing and Affine Rank Minimization Under Restricted Isometry , 2013, IEEE Transactions on Signal Processing.

[28]  Dylan S. Small,et al.  Instrumental Variables Estimation With Some Invalid Instruments and its Application to Mendelian Randomization , 2014, 1401.5755.

[29]  B. Efron Estimation and Accuracy After Model Selection , 2014, Journal of the American Statistical Association.

[30]  S. Geer,et al.  On asymptotically optimal confidence regions and tests for high-dimensional models , 2013, 1303.0518.

[31]  Adel Javanmard,et al.  Confidence intervals and hypothesis testing for high-dimensional regression , 2013, J. Mach. Learn. Res..

[32]  Hongzhe Li,et al.  Variable selection in regression with compositional covariates , 2014 .

[33]  Christian L. Müller,et al.  Sparse and Compositionally Robust Inference of Microbial Ecological Networks , 2014, PLoS Comput. Biol..

[34]  A. Clark The Human Microbiome. , 2017, The American journal of nursing.