M-estimation with the Trimmed l1 Penalty

We study high-dimensional estimators with the trimmed $\ell_1$ penalty, which leaves the $h$ largest parameter entries penalty-free. While optimization techniques for this nonconvex penalty have been studied, the statistical properties have not yet been analyzed. We present the first statistical analyses for $M$-estimation and characterize support recovery, $\ell_\infty$ and $\ell_2$ error of the trimmed $\ell_1$ estimates as a function of the trimming parameter $h$. Our results show different regimes based on how $h$ compares to the true support size. Our second contribution is a new algorithm for the trimmed regularization problem, which has the same theoretical convergence rate as the difference of convex (DC) algorithms, but in practice is faster and finds lower objective values. Empirical evaluation of $\ell_1$ trimming for sparse linear regression and graphical model estimation indicate that trimmed $\ell_1$ can outperform vanilla $\ell_1$ and non-convex alternatives. Our last contribution is to show that the trimmed penalty is beneficial beyond $M$-estimation, and yields promising results for two deep learning tasks: input structures recovery and network sparsification.

[1]  Akiko Takeda,et al.  DC formulations and algorithms for sparse optimization problems , 2017, Mathematical Programming.

[2]  Martin J. Wainwright,et al.  Convergence guarantees for a class of non-convex and non-smooth optimization problems , 2018, ICML.

[3]  David P. Wipf,et al.  Compressing Neural Networks using the Variational Information Bottleneck , 2018, ICML.

[4]  Samet Oymak,et al.  Learning Compact Neural Networks with Regularization , 2018, ICML.

[5]  Diederik P. Kingma,et al.  Learning Sparse Neural Networks through L0 Regularization , 2017, ICLR.

[6]  Dimitris Bertsimas,et al.  The Trimmed Lasso: Sparsity and Robustness , 2017, 1708.04527.

[7]  A. Aravkin,et al.  High-Dimensional Trimmed Estimators: A General Framework for Robust Structured Estimation , 2016 .

[8]  Eunho Yang,et al.  High-Dimensional Trimmed Estimators: A General Framework for Robust Structured Estimation , 2016, 1605.08299.

[9]  Po-Ling Loh,et al.  Support recovery without incoherence: A case for nonconvex regularization , 2014, ArXiv.

[10]  Mário A. T. Figueiredo,et al.  Sparse Estimation with Strongly Correlated Variables using Ordered Weighted L1 Regularization , 2014, 1409.4005.

[11]  Weijie J. Su,et al.  SLOPE-ADAPTIVE VARIABLE SELECTION VIA CONVEX OPTIMIZATION. , 2014, The annals of applied statistics.

[12]  J. Deasy,et al.  Inference of radio-responsive gene regulatory networks using the graphical lasso algorithm , 2014, BMC Bioinformatics.

[13]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[14]  Po-Ling Loh,et al.  Regularized M-estimators with nonconvexity: statistical and algorithmic theory for local optima , 2013, J. Mach. Learn. Res..

[15]  Benar Fux Svaiter,et al.  Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods , 2013, Math. Program..

[16]  Pradeep Ravikumar,et al.  Graphical models via univariate exponential family distributions , 2013, J. Mach. Learn. Res..

[17]  Jian Huang,et al.  COORDINATE DESCENT ALGORITHMS FOR NONCONVEX PENALIZED REGRESSION, WITH APPLICATIONS TO BIOLOGICAL FEATURE SELECTION. , 2011, The annals of applied statistics.

[18]  Martin J. Wainwright,et al.  Restricted Eigenvalue Properties for Correlated Gaussian Designs , 2010, J. Mach. Learn. Res..

[19]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[20]  Martin J. Wainwright,et al.  A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers , 2009, NIPS.

[21]  S. Geer,et al.  On the conditions used to prove oracle results for the Lasso , 2009, 0910.0722.

[22]  Jean-Philippe Vert,et al.  Group lasso with overlap and graph lasso , 2009, ICML '09.

[23]  Martin J. Wainwright,et al.  Sharp Thresholds for High-Dimensional and Noisy Sparsity Recovery Using $\ell _{1}$ -Constrained Quadratic Programming (Lasso) , 2009, IEEE Transactions on Information Theory.

[24]  Bin Yu,et al.  High-dimensional covariance estimation by minimizing ℓ1-penalized log-determinant divergence , 2008, 0811.3628.

[25]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[26]  Alexandre d'Aspremont,et al.  Model Selection Through Sparse Max Likelihood Estimation Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data , 2022 .

[27]  M. Yuan,et al.  Model selection and estimation in the Gaussian graphical model , 2007 .

[28]  Peng Zhao,et al.  On Model Selection Consistency of Lasso , 2006, J. Mach. Learn. Res..

[29]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[30]  E. Candès,et al.  The Dantzig selector: Statistical estimation when P is much larger than n , 2005, math/0506081.

[31]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[32]  J. Woods Markov image modeling , 1976, 1976 IEEE Conference on Decision and Control including the 15th Symposium on Adaptive Processes.

[33]  Diederik P. Kingma,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[34]  P. Zhao,et al.  Grouped and Hierarchical Model Selection through Composite Absolute Penalties , 2007 .

[35]  Joel A. Tropp,et al.  ALGORITHMS FOR SIMULTANEOUS SPARSE APPROXIMATION , 2006 .

[36]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[37]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .