spred: Solving L1 Penalty with SGD

We propose to minimize a generic differentiable objective with $L_1$ constraint using a simple reparametrization and straightforward stochastic gradient descent. Our proposal is the direct generalization of previous ideas that the $L_1$ penalty may be equivalent to a differentiable reparametrization with weight decay. We prove that the proposed method, \textit{spred}, is an exact differentiable solver of $L_1$ and that the reparametrization trick is completely ``benign"for a generic nonconvex function. Practically, we demonstrate the usefulness of the method in (1) training sparse neural networks to perform gene selection tasks, which involves finding relevant features in a very high dimensional space, and (2) neural network compression task, to which previous attempts at applying the $L_1$-penalty have been unsuccessful. Conceptually, our result bridges the gap between the sparsity in deep learning and conventional statistical learning.

[1]  C. Poon,et al.  Smooth over-parameterized solvers for non-smooth structured optimization , 2022, Mathematical Programming.

[2]  G. Peyr'e,et al.  Smooth Bilevel Programming for Sparse Regularization , 2021, NeurIPS.

[3]  P. Pérez-Rodríguez,et al.  A review of deep learning applications for genomic selection , 2021, BMC Genomics.

[4]  Daniel L. K. Yamins,et al.  Pruning neural networks without any data by iteratively conserving synaptic flow , 2020, NeurIPS.

[5]  Jose Javier Gonzalez Ortiz,et al.  What is the State of Neural Network Pruning? , 2020, MLSys.

[6]  S. Kakade,et al.  Soft Threshold Weight Reparameterization for Learnable Sparsity , 2020, ICML.

[7]  Ali Farhadi,et al.  Discovering Neural Wirings , 2019, NeurIPS.

[8]  Martin Ester,et al.  Performance comparison of linear and non-linear feature selection methods for the analysis of large survey datasets , 2019, PloS one.

[9]  Erich Elsen,et al.  The State of Sparsity in Deep Neural Networks , 2019, ArXiv.

[10]  Michael I. Jordan,et al.  Gradient Descent Can Take Exponential Time to Escape Saddle Points , 2017, NIPS.

[11]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[12]  Peter D. Hoff,et al.  Lasso, fractional norm and structured sparse estimation using a Hadamard product parametrization , 2016, Comput. Stat. Data Anal..

[13]  Danilo Comminiello,et al.  Group sparse regularization for deep neural networks , 2016, Neurocomputing.

[14]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[15]  Mårten Fernö,et al.  Claudin‐2 is an independent negative prognostic factor in breast cancer and specifically predicts early liver recurrences , 2014, Molecular oncology.

[16]  Anton Belousov,et al.  Research-Based PAM50 Subtype Predictor Identifies Higher Responses and Improved Survival Outcomes in HER2-Positive Breast Cancer in the NOAH Study , 2014, Clinical Cancer Research.

[17]  Masashi Sugiyama,et al.  High-Dimensional Feature Selection by Feature-Wise Kernelized Lasso , 2012, Neural Computation.

[18]  E. Huang,et al.  Integrating Factor Analysis and a Transgenic Mouse Model to Reveal a Peripheral Blood Predictor of Breast Tumors , 2011, BMC Medical Genomics.

[19]  C. Doarn,et al.  Where is the proof? , 2010, Telemedicine journal and e-health : the official journal of the American Telemedicine Association.

[20]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[21]  Jack Satsangi,et al.  Regional variation in gene expression in the healthy colon is dysregulated in ulcerative colitis , 2008, Gut.

[22]  Stephen P. Boyd,et al.  Enhancing Sparsity by Reweighted ℓ1 Minimization , 2007, 0711.1612.

[23]  Gavin C. Cawley,et al.  Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation , 2006, NIPS.

[24]  Thomas D. Wu,et al.  Molecular subclasses of high-grade glioma predict prognosis, delineate a pattern of disease progression, and resemble stages in neurogenesis. , 2006, Cancer cell.

[25]  Larry Wasserman,et al.  All of Statistics: A Concise Course in Statistical Inference , 2004 .

[26]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[27]  S. Sathiya Keerthi,et al.  A simple and efficient algorithm for gene selection using sparse logistic regression , 2003, Bioinform..

[28]  Yves Grandvalet Least Absolute Shrinkage is Equivalent to Quadratic Penalization , 1998 .

[29]  James B. Simon,et al.  SGD Can Converge to Local Maxima , 2022, ICLR.

[30]  Jin Yao,et al.  Feature Selection for Nonlinear Regression and its Application to Cancer Research , 2015, SDM.

[31]  Gitta Kutyniok Compressed Sensing , 2012 .

[32]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[33]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[34]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.