The Strong Screening Rule for SLOPE

Extracting relevant features from data sets where the number of observations ($n$) is much smaller then the number of predictors ($p$) is a major challenge in modern statistics. Sorted L-One Penalized Estimation (SLOPE), a generalization of the lasso, is a promising method within this setting. Current numerical procedures for SLOPE, however, lack the efficiency that respective tools for the lasso enjoy, particularly in the context of estimating a complete regularization path. A key component in the efficiency of the lasso is predictor screening rules: rules that allow predictors to be discarded before estimating the model. This is the first paper to establish such a rule for SLOPE. We develop a screening rule for SLOPE by examining its subdifferential and show that this rule is a generalization of the strong rule for the lasso. Our rule is heuristic, which means that it may discard predictors erroneously. We present conditions under which this may happen and show that such situations are rare and easily safeguarded against by a simple check of the optimality conditions. Our numerical experiments show that the rule performs well in practice, leading to improvements by orders of magnitude for data in the $p \gg n$ domain, as well as incurring no additional computational overhead when $n \gg p$. We also examine the effect of correlation structures in the design matrix on the rule and discuss algorithmic strategies for employing the rule. Finally, we provide an efficient implementation of the rule in our R package SLOPE.

[1]  Sandra Paterlini,et al.  Sparse Index Clones via the Sorted L1-Norm , 2020, SSRN Electronic Journal.

[2]  Weijie Su,et al.  Algorithmic Analysis and Statistical Estimation of SLOPE via Approximate Message Passing , 2019, IEEE Transactions on Information Theory.

[3]  Weijie J. Su,et al.  SLOPE-ADAPTIVE VARIABLE SELECTION VIA CONVEX OPTIMIZATION. , 2014, The annals of applied statistics.

[4]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[5]  Peter J. Ramadge,et al.  Fast lasso screening tests based on correlations , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Wei Jiang,et al.  Adaptive Bayesian SLOPE -- High-dimensional Model Selection with Missing Values , 2019 .

[7]  R. Tibshirani,et al.  PATHWISE COORDINATE OPTIMIZATION , 2007, 0708.1485.

[8]  A. Zeileis,et al.  Regression Models for Count Data in R , 2008 .

[9]  Laurent El Ghaoui,et al.  Safe Feature Elimination in Sparse Supervised Learning , 2010, ArXiv.

[10]  Mário A. T. Figueiredo,et al.  The atomic norm formulation of OSCAR regularization with application to the Frank-Wolfe algorithm , 2014, 2014 22nd European Signal Processing Conference (EUSIPCO).

[11]  Mário A. T. Figueiredo,et al.  Decreasing Weighted Sorted ℓ1 Regularization , 2014, ArXiv.

[12]  P. Deb,et al.  Demand for Medical Care by the Elderly: A Finite Mixture Approach , 1997 .

[13]  Weijie J. Su,et al.  Group SLOPE – Adaptive Selection of Groups of Predictors , 2015, Journal of the American Statistical Association.

[14]  G. Pólya,et al.  Inequalities (Cambridge Mathematical Library) , 1934 .

[15]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[16]  Jie Wang,et al.  Lasso screening rules via dual polytope projection , 2012, J. Mach. Learn. Res..

[17]  Emmanuel J. Candès,et al.  SLOPE is Adaptive to Unknown Sparsity and Asymptotically Minimax , 2015, ArXiv.

[18]  H. Bondell,et al.  Simultaneous Regression Shrinkage, Variable Selection, and Supervised Clustering of Predictors with OSCAR , 2008, Biometrics.

[19]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[20]  Patrick Tardivel,et al.  The Geometry of Uniqueness and Model Selection of Penalized Estimators including SLOPE, LASSO, and Basis Pursuit , 2020 .

[21]  Weijie J. Su,et al.  Statistical estimation and testing via the sorted L1 norm , 2013, 1310.1969.

[22]  Bin Yu,et al.  On Model Selection Consistency of the Elastic Net When p >> n , 2008 .

[23]  Klaus Nordhausen,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second Edition by Trevor Hastie, Robert Tibshirani, Jerome Friedman , 2009 .

[24]  Robert D. Nowak,et al.  Ordered Weighted L1 Regularized Regression with Strongly Correlated Covariates: Theoretical Aspects , 2016, AISTATS.

[25]  M. Yuan,et al.  On the non‐negative garrotte estimator , 2007 .

[26]  Xiangrong Zeng,et al.  The Ordered Weighted $\ell_1$ Norm: Atomic Formulation, Projections, and Algorithms , 2014, 1409.4271.

[27]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[28]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[29]  R. Tibshirani,et al.  Strong rules for discarding predictors in lasso‐type problems , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[30]  Alexandre Gramfort,et al.  From safe screening rules to working sets for faster Lasso-type solvers , 2017, ArXiv.

[31]  Tyler B. Johnson,et al.  Blitz: A Principled Meta-Algorithm for Scaling Sparse Optimization , 2015, ICML.

[32]  Alexandre Gramfort,et al.  Gap Safe screening rules for sparsity enforcing penalties , 2016, J. Mach. Learn. Res..

[33]  Tianbao Yang,et al.  Efficient Feature Screening for Lasso-Type Problems via Hybrid Safe-Strong Rules , 2017, 1704.08742.

[34]  Steve R. Gunn,et al.  Result Analysis of the NIPS 2003 Feature Selection Challenge , 2004, NIPS.

[35]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[36]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.