A Scalable Empirical Bayes Approach to Variable Selection

Abstract A new empirical Bayes approach to variable selection in the context of generalized linear models is developed. The proposed algorithm scales to situations in which the number of putative explanatory variables is very large, possibly much larger than the number of responses. The coefficients in the linear predictor are modeled as a three-component mixture allowing the explanatory variables to have a random positive effect on the response, a random negative effect, or no effect. A key assumption is that only a small (but unknown) fraction of the candidate variables have a nonzero effect. This assumption, in addition to treating the coefficients as random effects facilitates an approach that is computationally efficient. In particular, the number of parameters that have to be estimated is small, and remains constant regardless of the number of explanatory variables. The model parameters are estimated using a generalized alternating maximization algorithm which is scalable, and leads to significantly faster convergence compared with simulation-based fully Bayesian methods. Supplementary materials for this article are available online.

[1]  M. Stephens,et al.  Bayesian variable selection regression for genome-wide association studies and other large-scale problems , 2011, 1110.6019.

[2]  Mohamed Hebiri,et al.  How Correlations Influence Lasso Prediction , 2012, IEEE Transactions on Information Theory.

[3]  Two-staged estimation of variance components in generalized linear mixed models , 2001 .

[4]  L. Breiman Better subset regression using the nonnegative garrote , 1995 .

[5]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[6]  R. Wolfinger,et al.  Generalized linear mixed models a pseudo-likelihood approach , 1993 .

[7]  F. Bushman,et al.  Linking Long-Term Dietary Patterns with Gut Microbial Enterotypes , 2011, Science.

[8]  L. O’Driscoll Gene Expression Profiling , 2011, Methods in Molecular Biology.

[9]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[10]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[11]  New York Dover,et al.  ON THE CONVERGENCE PROPERTIES OF THE EM ALGORITHM , 1983 .

[12]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[13]  Martin T. Wells,et al.  Laplace Approximated EM Microarray Analysis: An Empirical Bayes Approach for Comparative Microarray Experiments , 2010, 1101.0905.

[14]  Hongzhe Li,et al.  Variable selection in regression with compositional covariates , 2014 .

[15]  C. Mallows More comments on C p , 1995 .

[16]  Jiahua Chen,et al.  Hypothesis test for normal mixture models: The EM approach , 2009, 0908.3428.

[17]  Weijie J. Su,et al.  SLOPE-ADAPTIVE VARIABLE SELECTION VIA CONVEX OPTIMIZATION. , 2014, The annals of applied statistics.

[18]  Johannes Schmidt-Hieber,et al.  Conditions for Posterior Contraction in the Sparse Normal Means Problem , 2015, 1510.02232.

[19]  Karl J. Friston,et al.  Variance Components , 2003 .

[20]  Y. Benjamini,et al.  A simple forward selection procedure based on false discovery rate control , 2009, 0905.2819.

[21]  Peter Bühlmann,et al.  High-Dimensional Statistics with a View Toward Applications in Biology , 2014 .

[22]  C. L. Mallows Some comments on C_p , 1973 .

[23]  V. Johnson,et al.  On the use of non‐local prior densities in Bayesian hypothesis tests , 2010 .

[24]  James G. Scott,et al.  The horseshoe estimator for sparse signals , 2010 .

[25]  Parantu K. Shah,et al.  Genomic analysis of estrogen cascade reveals histone variant H2A.Z associated with breast cancer progression , 2008, Molecular systems biology.

[26]  T. Hesterberg,et al.  Least angle and ℓ1 penalized regression: A review , 2008, 0802.0964.

[27]  A. Nobel,et al.  Concordance among Gene-Expression – Based Predictors for Breast Cancer , 2011 .

[28]  Martin Clynes,et al.  BreastMark: An Integrated Approach to Mining Publicly Available Transcriptomic Datasets Relating to Breast Cancer Outcome , 2013, Breast Cancer Research.

[29]  H. Akaike A new look at the statistical model identification , 1974 .

[30]  S. Saha,et al.  RNA Expression Analysis Using an AntisenseBacillus subtilis Genome Array , 2001, Journal of bacteriology.

[31]  Nicholas G. Polson,et al.  The Horseshoe+ Estimator of Ultra-Sparse Signals , 2015, 1502.00560.

[32]  M. C. Jones,et al.  The Statistical Analysis of Compositional Data , 1986 .

[33]  Hsien-Da Huang,et al.  Systematic Analysis of the Association between Gut Flora and Obesity through High-Throughput Sequencing and Bioinformatics Approaches , 2014, BioMed research international.

[34]  Veronika Rockova,et al.  EMVS: The EM Approach to Bayesian Variable Selection , 2014 .

[35]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[36]  E. M. L. Beale,et al.  Nonlinear Programming: A Unified Approach. , 1970 .

[37]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[38]  J. Foekens,et al.  Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer , 2005, The Lancet.

[39]  J. S. Rao,et al.  Spike and slab variable selection: Frequentist and Bayesian strategies , 2005, math/0505633.

[40]  Malay Ghosh,et al.  The Inverse Gamma-Gamma Prior for Optimal Posterior Contraction and Multiple Hypothesis Testing , 2017, 1710.04369.

[41]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[42]  P. Turnbaugh,et al.  Microbial ecology: Human gut microbes associated with obesity , 2006, Nature.

[43]  I. James,et al.  Linear regression with censored data , 1979 .

[44]  R. Schall Estimation in generalized linear models with random effects , 1991 .

[45]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[46]  R. Knight,et al.  The Effect of Diet on the Human Gut Microbiome: A Metagenomic Analysis in Humanized Gnotobiotic Mice , 2009, Science Translational Medicine.

[47]  Udaya B. Kogalur,et al.  spikeslab: Prediction and Variable Selection Using Spike and Slab Regression , 2010, R J..

[48]  J. Goeman L1 Penalized Estimation in the Cox Proportional Hazards Model , 2009, Biometrical journal. Biometrische Zeitschrift.

[49]  E. George,et al.  Journal of the American Statistical Association is currently published by American Statistical Association. , 2007 .

[50]  Emmanuel J. Candès,et al.  Decoding by linear programming , 2005, IEEE Transactions on Information Theory.

[51]  S. Müller,et al.  Model Selection in Linear Mixed Models , 2013, 1306.2427.

[52]  김동일,et al.  LARS(Least Angle Regression)와 유전알고리즘을 결합한 변수 선택 알고리즘 , 2009 .

[53]  Christian L. Müller,et al.  Don't Fall for Tuning Parameters: Tuning-Free Variable Selection in High Dimensions With the TREX , 2014, AAAI.

[54]  N. Breslow,et al.  Approximate inference in generalized linear mixed models , 1993 .

[55]  G. Casella,et al.  Objective Bayesian Variable Selection , 2006 .

[56]  C. Mcgilchrist Estimation in Generalized Mixed Models , 1994 .

[57]  J. Whitehead Fitting Cox's Regression Model to Survival Data Using Glim , 1980 .

[58]  R. R. Hocking The analysis and selection of variables in linear regression , 1976 .

[59]  Ling-Hui Li,et al.  Tumor suppressor SCUBE2 inhibits breast-cancer cell migration and invasion through the reversal of epithelial–mesenchymal transition , 2014, Journal of Cell Science.

[60]  Stéphane Canu,et al.  Akaike's Information Criterion, Cp and Estimators of Loss for Elliptically Symmetric Distributions , 2014 .

[61]  Van,et al.  A gene-expression signature as a predictor of survival in breast cancer. , 2002, The New England journal of medicine.

[62]  V. Johnson,et al.  Bayesian Model Selection in High-Dimensional Settings , 2012, Journal of the American Statistical Association.

[63]  G. Casella,et al.  The Bayesian Lasso , 2008 .

[64]  C. Carvalho,et al.  Decoupling Shrinkage and Selection in Bayesian Linear Models: A Posterior Summary Perspective , 2014, 1408.0464.

[65]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[66]  William J. Byrne,et al.  Convergence Theorems for Generalized Alternating Minimization Procedures , 2005, J. Mach. Learn. Res..

[67]  Andrew G. Clark,et al.  Mapping Multiple Quantitative Trait Loci by Bayesian Classification , 2005, Genetics.

[68]  Gene H. Golub,et al.  Matrix computations , 1983 .

[69]  Jian Huang,et al.  COORDINATE DESCENT ALGORITHMS FOR NONCONVEX PENALIZED REGRESSION, WITH APPLICATIONS TO BIOLOGICAL FEATURE SELECTION. , 2011, The annals of applied statistics.

[70]  Brian J Reich,et al.  Consistent High-Dimensional Bayesian Variable Selection via Penalized Credible Regions , 2012, Journal of the American Statistical Association.

[71]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[72]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[73]  Yudong D. He,et al.  A Gene-Expression Signature as a Predictor of Survival in Breast Cancer , 2002 .