Combining a relaxed EM algorithm with Occam's razor for Bayesian variable selection in high-dimensional regression

We address the problem of Bayesian variable selection for high-dimensional linear regression. We consider a generative model that uses a spike-and-slab-like prior distribution obtained by multiplying a deterministic binary vector, which traduces the sparsity of the problem, with a random Gaussian parameter vector. The originality of the work is to consider inference through relaxing the model and using a type-II log-likelihood maximization based on an EM algorithm. Model selection is performed afterwards relying on Occam's razor and on a path of models found by the EM algorithm. Numerical comparisons between our method, called spinyReg, and state-of-the-art high-dimensional variable selection algorithms (such as lasso, adaptive lasso, stability selection or spike-and-slab procedures) are reported. Competitive variable selection results and predictive performances are achieved on both simulated and real benchmark data sets. An original regression data set involving the prediction of the number of visitors of the Orsay museum in Paris using bike-sharing system data is also introduced, illustrating the efficiency of the proposed approach. The R package spinyReg implementing the method proposed in this paper is available on CRAN.

[1]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[2]  E. Candès Mathematics of Sparsity (and a Few Other Things) , 2014 .

[3]  N. Narisetty,et al.  Bayesian variable selection with shrinking and diffusing priors , 2014, 1405.6545.

[4]  Anne-Laure Boulesteix,et al.  Regularized estimation of large-scale gene association networks using graphical Gaussian models , 2009, BMC Bioinformatics.

[5]  V. Johnson,et al.  Bayesian Model Selection in High-Dimensional Settings , 2012, Journal of the American Statistical Association.

[6]  James O. Ramsay,et al.  Functional Data Analysis , 2005 .

[7]  Udaya B. Kogalur,et al.  spikeslab: Prediction and Variable Selection Using Spike and Slab Regression , 2010, R J..

[8]  C. Bouveyron,et al.  The discriminative functional mixture model for a comparative analysis of bike sharing systems , 2016, 1601.07999.

[9]  T. Yen A majorization–minimization approach to variable selection using spike and slab priors , 2010, 1005.0891.

[10]  Francis R. Bach,et al.  Bolasso: model consistent Lasso estimation through the bootstrap , 2008, ICML '08.

[11]  James G. Scott,et al.  Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem , 2010, 1011.2333.

[12]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[13]  E. George,et al.  Journal of the American Statistical Association is currently published by American Statistical Association. , 2007 .

[14]  Michael I. Jordan,et al.  On Convergence Properties of the EM Algorithm for Gaussian Mixtures , 1996, Neural Computation.

[15]  A. Tsybakov,et al.  Exponential Screening and optimal rates of sparse estimation , 2010, 1003.2654.

[16]  Jean-Michel Marin,et al.  Regularization in regression: comparing Bayesian and frequentist methods in a poorly informative situation , 2010, 1010.0300.

[17]  J. S. Rao,et al.  Spike and slab variable selection: Frequentist and Bayesian strategies , 2005, math/0505633.

[18]  Christian P. Robert,et al.  Monte Carlo Statistical Methods , 2005, Springer Texts in Statistics.

[19]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[20]  J. Friedman,et al.  Estimating Optimal Transformations for Multiple Regression and Correlation. , 1985 .

[21]  Wenjiang J. Fu Penalized Regressions: The Bridge versus the Lasso , 1998 .

[22]  Walter Zucchini,et al.  Model Selection , 2011, International Encyclopedia of Statistical Science.

[23]  Veronika Rockova,et al.  EMVS: The EM Approach to Bayesian Variable Selection , 2014 .

[24]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[25]  New York Dover,et al.  ON THE CONVERGENCE PROPERTIES OF THE EM ALGORITHM , 1983 .

[26]  G. McLachlan,et al.  The EM Algorithm and Extensions: Second Edition , 2008 .

[27]  Julien Mairal,et al.  Optimization with Sparsity-Inducing Penalties , 2011, Found. Trends Mach. Learn..

[28]  L Thomas Markham,et al.  Oppenheim's inequality for positive definite matrices , 1986 .

[29]  Gérard Govaert,et al.  Clustering the Vélib' dynamic Origin/Destination flows using a family of Poisson mixture models , 2014, Neurocomputing.

[30]  Christophe Ambroise,et al.  Sparsity by Worst-Case Quadratic Penalties , 2012 .

[31]  Charles Bouveyron,et al.  The Discriminative Functional Mixture Model for the Analysis of Bike Sharing Systems , 2014 .

[32]  Nicole Krämer,et al.  Regularized Estimation of Large Scale Gene Regulatory Networks , 2007 .

[33]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevan e Ve tor Ma hine , 2001 .

[34]  David P. Wipf,et al.  A New View of Automatic Relevance Determination , 2007, NIPS.

[35]  Julien Jacques,et al.  Variable Clustering in High-Dimensional Linear Regression: The R Package clere , 2016, R J..

[36]  R. O’Hara,et al.  A review of Bayesian variable selection methods: what, how and which , 2009 .

[37]  Judith Rousseau,et al.  Bayes and empirical Bayes : Do they merge? , 2012, 1204.1470.

[38]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[39]  David J. C. MacKay,et al.  Comparison of Approximate Methods for Handling Hyperparameters , 1999, Neural Computation.

[40]  Trevor Hastie,et al.  Statistical Learning with Sparsity: The Lasso and Generalizations , 2015 .

[41]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[42]  Daniel Hernández-Lobato,et al.  Generalized spike-and-slab priors for Bayesian group feature selection using expectation propagation , 2013, J. Mach. Learn. Res..

[43]  W Y Zhang,et al.  Discussion on `Sure independence screening for ultra-high dimensional feature space' by Fan, J and Lv, J. , 2008 .

[44]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[45]  M. Clyde,et al.  Mixtures of g Priors for Bayesian Variable Selection , 2008 .

[46]  Julien Jacques,et al.  Variable clustering in high dimensional linear regression models , 2014 .

[47]  Balas K. Natarajan,et al.  Sparse Approximate Solutions to Linear Systems , 1995, SIAM J. Comput..

[48]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[49]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[50]  Karim Lounici,et al.  Pac-Bayesian Bounds for Sparse Regression Estimation with Exponential Weights , 2010, 1009.2707.

[51]  T. J. Mitchell,et al.  Bayesian Variable Selection in Linear Regression , 1988 .

[52]  A. Tsybakov,et al.  Aggregation for Gaussian regression , 2007, 0710.3654.

[53]  J. S. Rao,et al.  Spike and Slab Gene Selection for Multigroup Microarray Data , 2005 .

[54]  P. Vieu,et al.  Nonparametric Functional Data Analysis: Theory and Practice (Springer Series in Statistics) , 2006 .

[55]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[56]  Philippe Vieu,et al.  Variable selection in infinite-dimensional problems , 2014 .

[57]  J. Nocedal,et al.  A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[58]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[59]  Benedikt M. Pötscher,et al.  On the Distribution of Penalized Maximum Likelihood Estimators: The LASSO, SCAD, and Thresholding , 2007, J. Multivar. Anal..

[60]  Meïli C. Baragatti,et al.  A study of variable selection using g-prior distribution with ridge parameter , 2011, Comput. Stat. Data Anal..

[61]  V. Sheffield,et al.  Regulation of gene expression in the mammalian eye and its relevance to eye disease , 2006, Proceedings of the National Academy of Sciences.

[62]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[63]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[64]  T. Raghavan,et al.  Nonnegative Matrices and Applications , 1997 .

[65]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[66]  Isabelle Guyon,et al.  Model Selection: Beyond the Bayesian/Frequentist Divide , 2010, J. Mach. Learn. Res..

[67]  A. Oppenheim Inequalities Connected with Definite Hermitian Forms , 1930 .

[68]  Dean Phillips Foster,et al.  Calibration and Empirical Bayes Variable Selection , 1997 .

[69]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[70]  Enea G. Bongiorno,et al.  Contributions in Infinite-Dimensional Statistics and Related Topics , 2014 .