High-dimensional Log-Error-in-Variable Regression with Applications to Microbial Compositional Data Analysis

In microbiome and genomic study, the regression of compositional data has been a crucial tool for identifying microbial taxa or genes that are associated with clinical phenotypes. To account for the variation in sequencing depth, the classic log-contrast model is often used where read counts are normalized into compositions. However, zero read counts and the randomness in covariates remain critical issues. In this article, we introduce a surprisingly simple, interpretable, and efficient method for the estimation of compositional data regression through the lens of a novel high-dimensional log-error-in-variable regression model. The proposed method provides both corrections on sequencing data with possible overdispersion and simultaneously avoids any subjective imputation of zero read counts. We provide theoretical justifications with matching upper and lower bounds for the estimation error. We also consider a general log-error-in-variable regression model with corresponding estimation method to accommodate broader situations. The merit of the procedure is illustrated through real data analysis and simulation studies.

[1]  Po-Ling Loh,et al.  High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity , 2011, NIPS.

[2]  Anru R. Zhang,et al.  Multi-sample Estimation of Bacterial Composition Matrix in Metagenomics Data , 2017, 1706.02380.

[3]  Anru Zhang,et al.  Sparse Representation of a Polytope and Recovery of Sparse Signals and Low-Rank Matrices , 2013, IEEE Transactions on Information Theory.

[4]  K. Shiraki,et al.  Comparison of the gut microbiota composition between obese and non-obese individuals in a Japanese population, as analyzed by terminal restriction fragment length polymorphism and next-generation sequencing , 2015, BMC Gastroenterology.

[5]  A. Tsybakov,et al.  Linear and conic programming estimators in high dimensional errors‐in‐variables models , 2014, 1408.0241.

[6]  A. Tsybakov,et al.  Exponential Screening and optimal rates of sparse estimation , 2010, 1003.2654.

[7]  A. Tsybakov,et al.  Sparse recovery under matrix uncertainty , 2008, 0812.2818.

[8]  A. Belloni,et al.  Confidence bands for coefficients in high dimensional linear models with error-in-variables , 2017, 1703.00469.

[9]  M. Crowell,et al.  Human gut microbiota in obesity and after gastric bypass , 2009, Proceedings of the National Academy of Sciences.

[10]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[11]  W. D. de Vos,et al.  Akkermansia muciniphila and its role in regulating host functions. , 2017, Microbial pathogenesis.

[12]  Miodrag Potkonjak,et al.  Location errors in wireless embedded sensor networks: sources, models, and effects on applications , 2002, MOCO.

[13]  David J. Edwards,et al.  Hypothesis Testing and Power Calculations for Taxonomic-Based Human Microbiome Data , 2012, PloS one.

[14]  Xin Jiang,et al.  Minimax Optimal Rates for Poisson Inverse Problems With Physical Constraints , 2014, IEEE Transactions on Information Theory.

[15]  Hongzhe Li,et al.  Variable selection in regression with compositional covariates , 2014 .

[16]  Victor Chernozhukov,et al.  Pivotal Estimation Via Self-Normalization for High-Dimensional Linear Models with Errors in Variables , 2017, 1708.08353.

[17]  Eric Z. Chen,et al.  Inflammation, Antibiotics, and Diet as Environmental Stressors of the Gut Microbiome in Pediatric Crohn's Disease. , 2015, Cell host & microbe.

[18]  V. Pawlowsky-Glahn,et al.  Dealing with Zeros and Missing Values in Compositional Data Sets Using Nonparametric Imputation , 2003 .

[19]  Anru R. Zhang,et al.  Microbial Composition Estimation from Sparse Count Data , 2017 .

[20]  A. Tsybakov,et al.  Improved Matrix Uncertainty Selector , 2011, 1112.4413.

[21]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[22]  Hongzhe Li,et al.  Generalized linear models with linear constraints for microbiome compositional data , 2018, Biometrics.

[23]  E. Purdom,et al.  Statistical Applications in Genetics and Molecular Biology Error Distribution for Gene Expression Data , 2011 .

[24]  Emmanuel J. Candès,et al.  Decoding by linear programming , 2005, IEEE Transactions on Information Theory.

[25]  Shuheng Zhou,et al.  Errors-in-variables models with dependent measurements , 2016, 1611.04701.

[26]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[27]  Anru R. Zhang,et al.  Regression Analysis for Microbiome Compositional Data , 2016, 1603.00974.

[28]  Anru R. Zhang,et al.  On the non‐asymptotic and sharp lower tail bounds of random variables , 2018, Stat.

[29]  Manoranjan Pal,et al.  Consistent moment estimators of regression coefficients in the presence of errors in variables , 1980 .

[30]  S. Ahrné,et al.  The Microbiota of the Gut in Preschool Children With Normal and Excessive Body Weight , 2012, Obesity.

[31]  Hui Zou,et al.  CoCoLasso for High-dimensional Error-in-variables Regression , 2015, 1510.07123.

[32]  John Aitchison,et al.  The Statistical Analysis of Compositional Data , 1986 .

[33]  Hongyu Zhao,et al.  Structured subcomposition selection in regression and its application to microbiome data analysis , 2017 .

[34]  L. Guttman,et al.  Statistical Adjustment of Data , 1944 .

[35]  J. Aitchison,et al.  Log contrast models for experiments with mixtures , 1984 .

[36]  Torben Tvedebrink,et al.  Overdispersion in allelic counts and θ-correction in forensic genetics. , 2009, Theoretical population biology.

[37]  L. Cigliano,et al.  Rescue of Fructose-Induced Metabolic Syndrome by Antibiotics or Faecal Transplantation in a Rat Model of Obesity , 2015, PloS one.

[38]  Katherine H. Huang,et al.  A framework for human microbiome research , 2012, Nature.

[39]  F. Levenez,et al.  Akkermansia muciniphila and improved metabolic health during a dietary intervention in obesity: relationship with gut microbiome richness and ecology , 2015, Gut.

[40]  M. Wong,et al.  Metagenomic sequencing of the human gut microbiome before and after bariatric surgery in obese patients with type 2 diabetes: correlation with inflammatory and metabolic parameters , 2012, The Pharmacogenomics Journal.

[41]  P. Filzmoser,et al.  Bayesian-multiplicative treatment of count zeros in compositional data sets , 2015 .

[42]  B S Weir,et al.  Estimating F-statistics. , 2002, Annual review of genetics.

[43]  Brian Goodman,et al.  The microbiome and cancer , 2018, The Journal of pathology.

[44]  E. Mardis,et al.  An obesity-associated gut microbiome with increased capacity for energy harvest , 2006, Nature.

[45]  Dimitris Achlioptas,et al.  Database-friendly random projections , 2001, PODS.

[46]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[47]  W. Greene,et al.  计量经济分析 = Econometric analysis , 2009 .

[48]  Rob Knight,et al.  Temporal variability is a personalized feature of the human microbiome , 2014, Genome Biology.

[49]  D. Huson,et al.  Effects of Surgical and Dietary Weight Loss Therapy for Obesity on Gut Microbiota Composition and Nutrient Absorption , 2015, BioMed research international.

[50]  J. Hausman Mismeasured Variables in Econometric Analysis: Problems from the Right and Problems from the Left , 2001 .

[51]  Rebecca Willett,et al.  Inference of High-dimensional Autoregressive Generalized Linear Models , 2016, ArXiv.

[52]  V. Pawlowsky-Glahn,et al.  Zero Replacement in Compositional Data Sets , 2000 .

[53]  James E. Mosimann,et al.  On the compound negative multinomial distribution and correlations among inversely sampled pollen counts , 1963 .

[54]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[55]  Hongzhe Li,et al.  VARIABLE SELECTION FOR SPARSE DIRICHLET-MULTINOMIAL REGRESSION WITH AN APPLICATION TO MICROBIOME DATA ANALYSIS. , 2013, The annals of applied statistics.

[56]  Yang Cao,et al.  Poisson Matrix Recovery and Completion , 2015, IEEE Transactions on Signal Processing.

[57]  B. Roe,et al.  A core gut microbiome in obese and lean twins , 2008, Nature.

[58]  J. Mosimann On the compound multinomial distribution, the multivariate β-distribution, and correlations among proportions , 1962 .

[59]  Hongzhe Li Microbiome, Metagenomics, and High-Dimensional Compositional Data Analysis , 2015 .

[60]  Qiang Feng,et al.  A metagenome-wide association study of gut microbiota in type 2 diabetes , 2012, Nature.