Robust regression with compositional covariates

Many high-throughput sequencing data sets in biology are compositional in nature. A prominent example is microbiome profiling data, including targeted amplicon-based and metagenomic sequencing data. These profiling data comprises surveys of microbial communities in their natural habitat and sparse proportional (or compositional) read counts that represent operational taxonomic units or genes. When paired measurements of other covariates, including physicochemical properties of the habitat or phenotypic variables of the host, are available, inference of parsimonious and robust statistical relationships between the microbial abundance data and the covariate measurements is often an important first step in exploratory data analysis. To this end, we propose a sparse robust statistical regression framework that considers compositional and non-compositional measurements as predictors and identifies outliers in continuous response variables. Our model extends the seminal log-contrast model of Aitchison and Bacon-Shone (1984) by a mean shift formulation for capturing outliers, sparsity-promoting convex and non-convex penalties for parsimonious model selection, and data-driven robust initialization procedures adapted to the compositional setting. We show, in theory and simulations, the ability of our approach to jointly select a sparse set of predictive microbial features and identify outliers in the response. We illustrate the viability of our method by robustly predicting human body mass indices from American Gut Project amplicon data and non-compositional covariate data. We believe that the robust estimators introduced here and available in the R package RobRegCC can serve as a practical tool for reliable statistical regression analysis of compositional data, including microbiome survey data.

[1]  Jianqing Fan,et al.  Regularization of Wavelet Approximations , 2001 .

[2]  N. Ajami,et al.  A Lachnospiraceae-dominated bacterial signature in the fecal microbiota of HIV-infected individuals from Colombia, South America , 2018, Scientific Reports.

[3]  Christian L. Müller,et al.  Regression Models for Compositional Data: General Log-Contrast Formulations, Proximal Optimization, and Microbiome Data Applications , 2019, Statistics in Biosciences.

[4]  M. Watson,et al.  The Madness of Microbiome: Attempting To Find Consensus “Best Practice” for 16S Microbiome Studies , 2018, Applied and Environmental Microbiology.

[5]  V. Yohai,et al.  A Fast Procedure for Outlier Diagnostics in Large Regression Problems , 1999 .

[6]  Katherine H. Huang,et al.  Structure, Function and Diversity of the Healthy Human Microbiome , 2012, Nature.

[7]  S. Heo,et al.  Comparative analysis of gut microbiota associated with body mass index in a large Korean cohort , 2017, BMC Microbiology.

[8]  Yiyuan She,et al.  Outlier Detection Using Nonconvex Penalized Regression , 2010, ArXiv.

[9]  Jonathan Friedman,et al.  Inferring Correlation Networks from Genomic Survey Data , 2012, PLoS Comput. Biol..

[10]  Yiyuan She,et al.  Thresholding-based Iterative Selection Procedures for Generalized Linear Models , 2009, 0911.5460.

[11]  B. Ripley,et al.  Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[12]  Mingyao Yang,et al.  The bacterial communities associated with fecal types and body weight of rex rabbits , 2015, Scientific Reports.

[13]  S. Geer,et al.  Oracle Inequalities and Optimal Inference under Group Sparsity , 2010, 1007.1771.

[14]  V. Yohai,et al.  A Fast Algorithm for S-Regression Estimates , 2006 .

[15]  V. Yohai,et al.  Robust Statistics: Theory and Methods , 2006 .

[16]  J. Graf,et al.  Role of Gut Microbiota and Short Chain Fatty Acids in Modulating Energy Harvest and Fat Partitioning in Youth. , 2016, The Journal of clinical endocrinology and metabolism.

[17]  Anru R. Zhang,et al.  High-dimensional Log-Error-in-Variable Regression with Applications to Microbial Compositional Data Analysis , 2018, 1811.11709.

[18]  E. Murphy,et al.  The gut microbiota and its relationship to diet and obesity , 2012, Gut microbes.

[19]  G. Bedoya,et al.  Developmental pathways inferred from modularity, morphological integration and fluctuating asymmetry patterns in the human face , 2018, Scientific Reports.

[20]  Luis Pedro Coelho,et al.  Structure and function of the global ocean microbiome , 2015, Science.

[21]  Martin Hartmann,et al.  Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities , 2009, Applied and Environmental Microbiology.

[22]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[23]  Ezequiel Smucler,et al.  PENSE: A Penalized Elastic Net S-Estimator , 2017 .

[24]  R. Paredes,et al.  Evolution of the gut microbiome following acute HIV-1 infection , 2019, Microbiome.

[25]  Patrick L. Combettes,et al.  Proximal Splitting Methods in Signal Processing , 2009, Fixed-Point Algorithms for Inverse Problems in Science and Engineering.

[26]  Martin J. Wainwright,et al.  Sharp Thresholds for High-Dimensional and Noisy Sparsity Recovery Using $\ell _{1}$ -Constrained Quadratic Programming (Lasso) , 2009, IEEE Transactions on Information Theory.

[27]  P. J. Huber Robust Estimation of a Location Parameter , 1964 .

[28]  M. Heimesaat,et al.  Changes of the Intestinal Microbiome–Host Homeostasis in HIV-Infected Individuals – A Focus on the Bacterial Gut Microbiome , 2017, European journal of microbiology & immunology.

[29]  Robert C. Edgar,et al.  UPARSE: highly accurate OTU sequences from microbial amplicon reads , 2013, Nature Methods.

[30]  D. Gevers,et al.  The Gut Microbiome Contributes to a Substantial Proportion of the Variation in Blood Lipids , 2015, Circulation research.

[31]  Rob Knight,et al.  American Gut: an Open Platform for Citizen Science Microbiome Research , 2018, mSystems.

[32]  F. Bushman,et al.  Linking Long-Term Dietary Patterns with Gut Microbial Enterotypes , 2011, Science.

[33]  Y. She,et al.  On the Finite-Sample Analysis of $\Theta$-estimators , 2015, 1512.03987.

[34]  Robert Tibshirani,et al.  Log‐ratio lasso: Scalable, sparse estimation for log‐ratio models , 2017, Biometrics.

[35]  Mia Hubert,et al.  Robust statistics for outlier detection , 2011, WIREs Data Mining Knowl. Discov..

[36]  R. Paredes,et al.  Balances: a New Perspective for Microbiome Analysis , 2017, mSystems.

[37]  P. Filzmoser,et al.  Linear regression with compositional explanatory variables , 2012 .

[38]  Li Chen,et al.  GMPR: A robust normalization method for zero-inflated count data with application to microbiome sequencing data , 2018, PeerJ.

[39]  Lawrence A. David,et al.  Diet rapidly and reproducibly alters the human gut microbiome , 2013, Nature.

[40]  Anestis Antoniadis,et al.  Wavelet methods in statistics: Some recent developments and their applications , 2007, 0712.0283.

[41]  Paul J. McMurdie,et al.  Exact sequence variants should replace operational taxonomic units in marker-gene data analysis , 2017, The ISME Journal.

[42]  P. L. Combettes,et al.  Primal-Dual Splitting Algorithm for Solving Inclusions with Mixtures of Composite, Lipschitzian, and Parallel-Sum Type Monotone Operators , 2011, Set-Valued and Variational Analysis.

[43]  Dimitri P. Bertsekas,et al.  Constrained Optimization and Lagrange Multiplier Methods , 1982 .

[44]  L. Briceño-Arias,et al.  A projected primal-dual splitting for solving constrained monotone inclusions , 2018, 1805.11687.

[45]  Heinz H. Bauschke,et al.  Convex Analysis and Monotone Operator Theory in Hilbert Spaces , 2011, CMS Books in Mathematics.

[46]  Y. She,et al.  Robust reduced-rank regression , 2015, Biometrika.

[47]  Ezequiel Smucler,et al.  Robust elastic net estimators for variable selection and identification of proteomic biomarkers , 2019 .

[48]  L. Ferguson,et al.  Role of gut microbiota in , 2009 .

[49]  Ricardo A. Maronna,et al.  Robust Ridge Regression for High-Dimensional Data , 2011, Technometrics.

[50]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[51]  V. Yohai HIGH BREAKDOWN-POINT AND HIGH EFFICIENCY ROBUST ESTIMATES FOR REGRESSION , 1987 .

[52]  Irène Gannaz,et al.  Robust estimation and wavelet thresholding in partially linear models , 2007, Stat. Comput..

[53]  Ilker Bayram,et al.  On the Convergence of the Iterative Shrinkage/Thresholding Algorithm With a Weakly Convex Penalty , 2015, IEEE Transactions on Signal Processing.

[54]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[55]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[56]  John Aitchison,et al.  The Statistical Analysis of Compositional Data , 1986 .

[57]  Hongyu Zhao,et al.  Structured subcomposition selection in regression and its application to microbiome data analysis , 2017 .

[58]  Robert C. Edgar,et al.  UNOISE2: improved error-correction for Illumina 16S and ITS amplicon sequencing , 2016, bioRxiv.

[59]  A. Shojaie,et al.  KERNEL-PENALIZED REGRESSION FOR ANALYSIS OF MICROBIOME DATA. , 2015, The annals of applied statistics.

[60]  Adel Javanmard,et al.  Confidence intervals and hypothesis testing for high-dimensional regression , 2013, J. Mach. Learn. Res..

[61]  Daniel H. Huson,et al.  Characterization of the Gut Microbial Community of Obese Patients Following a Weight-Loss Intervention Using Whole Metagenome Shotgun Sequencing , 2016, PloS one.

[62]  G. Wells,et al.  Metagenomics Reveals the Impact of Wastewater Treatment Plants on the Dispersal of Microorganisms and Genes in Aquatic Sediments , 2017, Applied and Environmental Microbiology.

[63]  Hua Zhou,et al.  Algorithms for Fitting the Constrained Lasso , 2016, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[64]  Y. She,et al.  Selective Factor Extraction in High Dimensions , 2014, 1403.6212.

[65]  Rick L. Stevens,et al.  A communal catalogue reveals Earth’s multiscale microbial diversity , 2017, Nature.

[66]  Sandra Fischer,et al.  Distributed under Creative Commons Cc-by 4.0 Rhea: a Transparent and Modular R Pipeline for Microbial Profiling Based on 16s Rrna Gene Amplicons , 2022 .

[67]  J. Aitchison,et al.  Log contrast models for experiments with mixtures , 1984 .

[68]  S. MacEachern,et al.  Regularization of Case-Specific Parameters for Robustness and Efficiency , 2012, 1210.0701.

[69]  C. Robert,et al.  Gut microbiota associated with HIV infection is significantly enriched in bacteria tolerant to oxygen , 2016, BMJ open gastroenterology.

[70]  Kun Chen,et al.  LOG-CONTRAST REGRESSION WITH FUNCTIONAL COMPOSITIONAL PREDICTORS: LINKING PRETERM INFANT'S GUT MICROBIOME TRAJECTORIES TO NEUROBEHAVIORAL OUTCOME. , 2018, The annals of applied statistics.

[71]  Hongzhe Li,et al.  Variable selection in regression with compositional covariates , 2014 .

[72]  C. Quince,et al.  Dirichlet Multinomial Mixtures: Generative Models for Microbial Metagenomics , 2012, PloS one.

[73]  Peter J. Rousseeuw,et al.  ROBUST REGRESSION BY MEANS OF S-ESTIMATORS , 1984 .

[74]  Matthias Templ,et al.  Applied compositional data analysis , 2019 .

[75]  S. Gianella,et al.  An altered intestinal mucosal microbiome in HIV-1 infection is associated with mucosal and systemic immune activation and endotoxemia , 2014, Mucosal Immunology.

[76]  Trac D. Tran,et al.  Robust Lasso With Missing and Grossly Corrupted Observations , 2011, IEEE Transactions on Information Theory.

[77]  Y. She,et al.  Thresholding-based iterative selection procedures for model selection and shrinkage , 2008, 0812.5061.

[78]  I. Daubechies,et al.  An iterative thresholding algorithm for linear inverse problems with a sparsity constraint , 2003, math/0307152.

[79]  Wanli Xu,et al.  Log-Contrast Regression with Functional Compositional Predictors: Linking Preterm Infant's Gut Microbiome Trajectories in Early Postnatal Period to Neurobehavioral Outcome , 2018, 1808.02403.

[80]  Marc Teboulle,et al.  A fast Iterative Shrinkage-Thresholding Algorithm with application to wavelet-based image deblurring , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[81]  Anru R. Zhang,et al.  Regression Analysis for Microbiome Compositional Data , 2016, 1603.00974.

[82]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[83]  Paul J. McMurdie,et al.  DADA2: High resolution sample inference from Illumina amplicon data , 2016, Nature Methods.