Trimming the $\ell_1$ Regularizer: Statistical Analysis, Optimization, and Applications to Deep Learning

We study high-dimensional estimators with the trimmed `1 penalty, which leaves the h largest parameter entries penalty-free. While optimization techniques for this nonconvex penalty have been studied, the statistical properties have not yet been analyzed. We present the first statistical analyses for M -estimation, and characterize support recovery, `∞ and `2 error of the trimmed `1 estimates as a function of the trimming parameter h. Our results show different regimes based on how h compares to the true support size. Our second contribution is a new algorithm for the trimmed regularization problem, which has the same theoretical convergence rate as difference of convex (DC) algorithms, but in practice is faster and finds lower objective values. Empirical evaluation of `1 trimming for sparse linear regression and graphical model estimation indicate that trimmed `1 can outperform vanilla `1 and non-convex alternatives. Our last contribution is to show that the trimmed penalty is beneficial beyond M -estimation, and yields promising results for two deep learning tasks: input structures recovery and network sparsification.

[1]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[2]  Martin J. Wainwright,et al.  Convergence guarantees for a class of non-convex and non-smooth optimization problems , 2018, ICML.

[3]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[4]  Po-Ling Loh,et al.  Support recovery without incoherence: A case for nonconvex regularization , 2014, ArXiv.

[5]  S. Geer,et al.  On the conditions used to prove oracle results for the Lasso , 2009, 0910.0722.

[6]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[7]  Tong Zhang,et al.  A General Theory of Concave Regularization for High-Dimensional Sparse Estimation Problems , 2011, 1108.4988.

[8]  Ariel D. Procaccia,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[9]  Po-Ling Loh,et al.  Regularized M-estimators with nonconvexity: statistical and algorithmic theory for local optima , 2013, J. Mach. Learn. Res..

[10]  Eunho Yang,et al.  High-Dimensional Trimmed Estimators: A General Framework for Robust Structured Estimation , 2016, 1605.08299.

[11]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[12]  Steffen L. Lauritzen,et al.  Graphical models in R , 1996 .

[13]  M. Yuan,et al.  Model selection and estimation in the Gaussian graphical model , 2007 .

[14]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[15]  Bin Yu,et al.  High-dimensional covariance estimation by minimizing ℓ1-penalized log-determinant divergence , 2008, 0811.3628.

[16]  Jian Huang,et al.  COORDINATE DESCENT ALGORITHMS FOR NONCONVEX PENALIZED REGRESSION, WITH APPLICATIONS TO BIOLOGICAL FEATURE SELECTION. , 2011, The annals of applied statistics.

[17]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[18]  Mário A. T. Figueiredo,et al.  Sparse Estimation with Strongly Correlated Variables using Ordered Weighted L1 Regularization , 2014, 1409.4005.

[19]  Martin J. Wainwright,et al.  A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers , 2009, NIPS.

[20]  Samet Oymak,et al.  Learning Compact Neural Networks with Regularization , 2018, ICML.

[21]  Dimitris Bertsimas,et al.  The Trimmed Lasso: Sparsity and Robustness , 2017, 1708.04527.

[22]  Martin J. Wainwright,et al.  Restricted Eigenvalue Properties for Correlated Gaussian Designs , 2010, J. Mach. Learn. Res..

[23]  Terence Tao,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[24]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[25]  N. Meinshausen,et al.  LASSO-TYPE RECOVERY OF SPARSE REPRESENTATIONS FOR HIGH-DIMENSIONAL DATA , 2008, 0806.0145.

[26]  Pradeep Ravikumar,et al.  Graphical models via univariate exponential family distributions , 2013, J. Mach. Learn. Res..

[27]  Benar Fux Svaiter,et al.  Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods , 2013, Math. Program..

[28]  Weijie J. Su,et al.  SLOPE-ADAPTIVE VARIABLE SELECTION VIA CONVEX OPTIMIZATION. , 2014, The annals of applied statistics.

[29]  Martin J. Wainwright,et al.  Sharp Thresholds for High-Dimensional and Noisy Sparsity Recovery Using $\ell _{1}$ -Constrained Quadratic Programming (Lasso) , 2009, IEEE Transactions on Information Theory.

[30]  David P. Wipf,et al.  Compressing Neural Networks using the Variational Information Bottleneck , 2018, ICML.

[31]  Akiko Takeda,et al.  DC formulations and algorithms for sparse optimization problems , 2017, Mathematical Programming.

[32]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[33]  Max Welling,et al.  Learning Sparse Neural Networks through L0 Regularization , 2017, ICLR.

[34]  Alexandre d'Aspremont,et al.  Model Selection Through Sparse Max Likelihood Estimation Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data , 2022 .