Subset Selection with Shrinkage: Sparse Linear Modeling When the SNR Is Low

We study the behavior of a fundamental tool in sparse statistical modeling --the best-subset selection procedure (aka "best-subsets"). Assuming that the underlying linear model is sparse, it is well known, both in theory and in practice, that the best-subsets procedure works extremely well in terms of several statistical metrics (prediction, estimation and variable selection) when the signal to noise ratio (SNR) is high. However, its performance degrades substantially when the SNR is low -- it is outperformed in predictive accuracy by continuous shrinkage methods, such as ridge regression and the Lasso. We explain why this behavior should not come as a surprise, and contend that the original version of the classical best-subsets procedure was, perhaps, not designed to be used in the low SNR regimes. We propose a close cousin of best-subsets, namely, its $\ell_{q}$-regularized version, for $q \in\{1, 2\}$, which (a) mitigates, to a large extent, the poor predictive performance of best-subsets in the low SNR regimes; (b) performs favorably and generally delivers a substantially sparser model when compared to the best predictive models available via ridge regression and the Lasso. Our estimator can be expressed as a solution to a mixed integer second order conic optimization problem and, hence, is amenable to modern computational tools from mathematical optimization. We explore the theoretical properties of the predictive capabilities of the proposed estimator and complement our findings via several numerical experiments.

[1]  G. Pisier Remarques sur un résultat non publié de B. Maurey , 1981 .

[2]  T. J. Mitchell,et al.  Bayesian Variable Selection in Linear Regression , 1988 .

[3]  George L. Nemhauser,et al.  Constraint classification for mixed integer programming formulations , 1991 .

[4]  A. Atkinson Subset Selection in Regression , 1992 .

[5]  C. Stein,et al.  Estimation with Quadratic Loss , 1992 .

[6]  T. Hastie,et al.  [A Statistical View of Some Chemometrics Regression Tools]: Discussion , 1993 .

[7]  Yves Crama,et al.  Local Search in Combinatorial Optimization , 2018, Artificial Neural Networks.

[8]  Balas K. Natarajan,et al.  Sparse Approximate Solutions to Linear Systems , 1995, SIAM J. Comput..

[9]  S. Thomas McCormick,et al.  Integer Programming and Combinatorial Optimization , 1996, Lecture Notes in Computer Science.

[10]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[11]  L. Breiman Heuristics of instability and stabilization in model selection , 1996 .

[12]  Pierre Hansen,et al.  Variable Neighborhood Search , 2018, Handbook of Heuristics.

[13]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[14]  M. Ledoux The concentration of measure phenomenon , 2001 .

[15]  Y. Ritov,et al.  Persistence in high-dimensional linear predictor selection and the virtue of overparametrization , 2004 .

[16]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[17]  Emmanuel J. Candès,et al.  Decoding by linear programming , 2005, IEEE Transactions on Information Theory.

[18]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[19]  Dimitris Bertsimas,et al.  Optimization over integers , 2005 .

[20]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[21]  Martin J. Wainwright,et al.  Sharp thresholds for high-dimensional and noisy recovery of sparsity , 2006, ArXiv.

[22]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[23]  E. Greenshtein Best subset selection, persistence in high-dimensional statistical learning and optimization under l1 constraint , 2006, math/0702684.

[24]  Yufeng Liu,et al.  Variable Selection via A Combination of the L0 and L1 Penalties , 2007 .

[25]  P. Massart,et al.  Concentration inequalities and model selection , 2007 .

[26]  Nicolai Meinshausen,et al.  Relaxed Lasso , 2007, Comput. Stat. Data Anal..

[27]  A. Belloni,et al.  Least Squares After Model Selection in High-Dimensional Sparse Models , 2009, 1001.0188.

[28]  P. Bickel,et al.  SIMULTANEOUS ANALYSIS OF LASSO AND DANTZIG SELECTOR , 2008, 0801.1095.

[29]  Hao Helen Zhang,et al.  ON THE ADAPTIVE ELASTIC-NET WITH A DIVERGING NUMBER OF PARAMETERS. , 2009, Annals of statistics.

[30]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[31]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[32]  V. Koltchinskii,et al.  Nuclear norm penalization and optimal rates for noisy low rank matrix completion , 2010, 1011.6256.

[33]  James G. Scott,et al.  The horseshoe estimator for sparse signals , 2010 .

[34]  Jeff T. Linderoth,et al.  MILP Software , 2010 .

[35]  A. Tsybakov,et al.  Exponential Screening and optimal rates of sparse estimation , 2010, 1003.2654.

[36]  N. Verzelen Minimax risks for sparse regressions: Ultra-high-dimensional phenomenons , 2010, 1008.0526.

[37]  S. Geer,et al.  Oracle Inequalities and Optimal Inference under Group Sparsity , 2010, 1007.1771.

[38]  Martin J. Wainwright,et al.  Minimax Rates of Estimation for High-Dimensional Linear Regression Over $\ell_q$ -Balls , 2009, IEEE Transactions on Information Theory.

[39]  A. Belloni,et al.  Pivotal estimation via square-root Lasso in nonparametric regression , 2011, 1105.1475.

[40]  Sara van de Geer,et al.  Statistics for High-Dimensional Data , 2011 .

[41]  Charles Soussen,et al.  From Bernoulli–Gaussian Deconvolution to Sparse Signal Restoration , 2011, IEEE Transactions on Signal Processing.

[42]  Tong Zhang,et al.  A General Theory of Concave Regularization for High-Dimensional Sparse Estimation Problems , 2011, 1108.4988.

[43]  Cun-Hui Zhang,et al.  Scaled sparse linear regression , 2011, 1104.4595.

[44]  Sara van de Geer,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[45]  A. Dalalyan,et al.  Tight conditions for consistency of variable selection in the context of high dimensionality , 2011, 1106.4293.

[46]  P. Bartlett,et al.  ℓ1-regularized linear regression: persistence and oracle inequalities , 2012 .

[47]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[48]  Michael Unser,et al.  The analog formulation of sparsity implies infinite divisibility and rules out Bernoulli-Gaussian priors , 2012, 2012 IEEE Information Theory Workshop.

[49]  Xingye Qiao,et al.  Regularization after retention in ultrahigh dimensional linear regression models , 2013, 1311.5625.

[50]  Glenn Stone Statistics for High‐Dimensional Data: Methods, Theory and Applications. By Peter Buhlmann and Sara van de Geer. Springer, Berlin, Heidelberg. 2011. xvii+556 pages. €104.99 (hardback). ISBN 978‐3‐642‐20191‐2. , 2013 .

[51]  Yurii Nesterov,et al.  Gradient methods for minimizing composite functions , 2012, Mathematical Programming.

[52]  S. Mendelson,et al.  Compressed sensing under weak moment assumptions , 2014, 1401.2188.

[53]  Jinchi Lv,et al.  Asymptotic properties for combined L1 and concave regularization , 2014, 1605.03335.

[54]  Ryan J. Tibshirani,et al.  Degrees of freedom and model search , 2014, 1402.1920.

[55]  A. Dalalyan,et al.  On the Prediction Performance of the Lasso , 2014, 1402.1700.

[56]  Trevor J. Hastie,et al.  Matrix completion and low-rank SVD via fast alternating least squares , 2014, J. Mach. Learn. Res..

[57]  Emmanuel J. Candès,et al.  False Discoveries Occur Early on the Lasso Path , 2015, ArXiv.

[58]  D. Bertsimas,et al.  Best Subset Selection via a Modern Optimization Lens , 2015, 1507.03133.

[59]  Michael I. Jordan,et al.  Optimal prediction for sparse linear models? Lower bounds for coordinate-separable M-estimators , 2015, 1503.03188.

[60]  Jian Huang,et al.  THE Mnet method for variable selection , 2016 .

[61]  Rahul Mazumder,et al.  The Discrete Dantzig Selector: Estimating Sparse Linear Models via Mixed Integer Linear Optimization , 2015, IEEE Transactions on Information Theory.

[62]  R. Tibshirani,et al.  Extended Comparisons of Best Subset Selection, Forward Stepwise Selection, and the Lasso , 2017, 1707.08692.

[63]  Iain Dunning,et al.  Extended formulations in mixed integer conic quadratic programming , 2015, Mathematical Programming Computation.

[64]  Stephen G. Walker,et al.  Empirical Bayes posterior concentration in sparse high-dimensional linear models , 2014, 1406.7718.

[65]  David Gamarnik,et al.  High Dimensional Regression with Binary Coefficients. Estimating Squared Error and a Phase Transtition , 2017, COLT.

[66]  E. George,et al.  The Spike-and-Slab LASSO , 2018 .

[67]  A. Tsybakov,et al.  Slope meets Lasso: Improved oracle bounds and optimality , 2016, The Annals of Statistics.

[68]  Lei Sun,et al.  Bayesian l 0 ‐regularized least squares , 2017, Applied Stochastic Models in Business and Industry.

[69]  Nicholas G. Polson,et al.  Prediction Risk for the Horseshoe Regression , 2016, J. Mach. Learn. Res..

[70]  Alper Atamtürk,et al.  Rank-one Convexification for Sparse Regression , 2019, ArXiv.