论文信息 - spred: Solving L1 Penalty with SGD - 字舞流文

spred: Solving L1 Penalty with SGD

We propose to minimize a generic differentiable objective with $L_1$ constraint using a simple reparametrization and straightforward stochastic gradient descent. Our proposal is the direct generalization of previous ideas that the $L_1$ penalty may be equivalent to a differentiable reparametrization with weight decay. We prove that the proposed method, \textit{spred}, is an exact differentiable solver of $L_1$ and that the reparametrization trick is completely ``benign"for a generic nonconvex function. Practically, we demonstrate the usefulness of the method in (1) training sparse neural networks to perform gene selection tasks, which involves finding relevant features in a very high dimensional space, and (2) neural network compression task, to which previous attempts at applying the $L_1$-penalty have been unsuccessful. Conceptually, our result bridges the gap between the sparsity in deep learning and conventional statistical learning.

Liu Ziyin | Zihao Wang

[1] C. Poon,et al. Smooth over-parameterized solvers for non-smooth structured optimization , 2022, Mathematical Programming.

[2] G. Peyr'e,et al. Smooth Bilevel Programming for Sparse Regularization , 2021, NeurIPS.

[3] P. Pérez-Rodríguez,et al. A review of deep learning applications for genomic selection , 2021, BMC Genomics.

[4] Daniel L. K. Yamins,et al. Pruning neural networks without any data by iteratively conserving synaptic flow , 2020, NeurIPS.

[5] Jose Javier Gonzalez Ortiz,et al. What is the State of Neural Network Pruning? , 2020, MLSys.

[6] S. Kakade,et al. Soft Threshold Weight Reparameterization for Learnable Sparsity , 2020, ICML.

[7] Ali Farhadi,et al. Discovering Neural Wirings , 2019, NeurIPS.

[8] Martin Ester,et al. Performance comparison of linear and non-linear feature selection methods for the analysis of large survey datasets , 2019, PloS one.

[9] Erich Elsen,et al. The State of Sparsity in Deep Neural Networks , 2019, ArXiv.

[10] Michael I. Jordan,et al. Gradient Descent Can Take Exponential Time to Escape Saddle Points , 2017, NIPS.

[11] Michael I. Jordan,et al. How to Escape Saddle Points Efficiently , 2017, ICML.

[12] Peter D. Hoff,et al. Lasso, fractional norm and structured sparse estimation using a Hadamard product parametrization , 2016, Comput. Stat. Data Anal..

[13] Danilo Comminiello,et al. Group sparse regularization for deep neural networks , 2016, Neurocomputing.

[14] Song Han,et al. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[15] Mårten Fernö,et al. Claudin‐2 is an independent negative prognostic factor in breast cancer and specifically predicts early liver recurrences , 2014, Molecular oncology.

[16] Anton Belousov,et al. Research-Based PAM50 Subtype Predictor Identifies Higher Responses and Improved Survival Outcomes in HER2-Positive Breast Cancer in the NOAH Study , 2014, Clinical Cancer Research.

[17] Masashi Sugiyama,et al. High-Dimensional Feature Selection by Feature-Wise Kernelized Lasso , 2012, Neural Computation.

[18] E. Huang,et al. Integrating Factor Analysis and a Transgenic Mouse Model to Reveal a Peripheral Blood Predictor of Breast Tumors , 2011, BMC Medical Genomics.

[19] C. Doarn,et al. Where is the proof? , 2010, Telemedicine journal and e-health : the official journal of the American Telemedicine Association.

[20] Trevor Hastie,et al. Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[21] Jack Satsangi,et al. Regional variation in gene expression in the healthy colon is dysregulated in ulcerative colitis , 2008, Gut.

[22] Stephen P. Boyd,et al. Enhancing Sparsity by Reweighted ℓ1 Minimization , 2007, 0711.1612.

[23] Gavin C. Cawley,et al. Sparse Multinomial Logistic Regression via Bayesian L1 Regularisation , 2006, NIPS.

[24] Thomas D. Wu,et al. Molecular subclasses of high-grade glioma predict prognosis, delineate a pattern of disease progression, and resemble stages in neurogenesis. , 2006, Cancer cell.

[25] Larry Wasserman,et al. All of Statistics: A Concise Course in Statistical Inference , 2004 .

[26] R. Tibshirani,et al. Least angle regression , 2004, math/0406456.

[27] S. Sathiya Keerthi,et al. A simple and efficient algorithm for gene selection using sparse logistic regression , 2003, Bioinform..

[28] Yves Grandvalet. Least Absolute Shrinkage is Equivalent to Quadratic Penalization , 1998 .

[29] James B. Simon,et al. SGD Can Converge to Local Maxima , 2022, ICLR.

[30] Jin Yao,et al. Feature Selection for Nonlinear Regression and its Application to Cancer Research , 2015, SDM.

[31] Gitta Kutyniok. Compressed Sensing , 2012 .

[32] Marc Teboulle,et al. A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[33] R. Tibshirani. Regression Shrinkage and Selection via the Lasso , 1996 .

[34] Yann LeCun,et al. Optimal Brain Damage , 1989, NIPS.