Implicit differentiation of Lasso-type models for hyperparameter optimization

Setting regularization parameters for Lasso-type estimators is notoriously difficult, though crucial in practice. The most popular hyperparameter optimization approach is grid-search using held-out validation data. Grid-search however requires to choose a predefined grid for each parameter, which scales exponentially in the number of parameters. Another approach is to cast hyperparameter optimization as a bi-level optimization problem, one can solve by gradient descent. The key challenge for these methods is the estimation of the gradient with respect to the hyperparameters. Computing this gradient via forward or backward automatic differentiation is possible yet usually suffers from high memory consumption. Alternatively implicit differentiation typically involves solving a linear system which can be prohibitive and numerically unstable in high dimension. In addition, implicit differentiation usually assumes smooth loss functions, which is not the case for Lasso-type problems. This work introduces an efficient implicit differentiation algorithm, without matrix inversion, tailored for Lasso-type problems. Our approach scales to high-dimensional data by leveraging the sparsity of the solutions. Experiments demonstrate that the proposed method outperforms a large number of standard methods to optimize the error on held-out data, or the Stein Unbiased Risk Estimator (SURE).

[1]  Matthias W. Seeger,et al.  Cross-Validation Optimization for Large Scale Structured Classification Kernel Methods , 2008, J. Mach. Learn. Res..

[2]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[3]  Nando de Freitas,et al.  A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning , 2010, ArXiv.

[4]  Massimiliano Pontil,et al.  Bilevel learning of the Group Lasso structure , 2018, NeurIPS.

[5]  Ryan P. Adams,et al.  Gradient-based Hyperparameter Optimization through Reversible Learning , 2015, ICML.

[6]  Mohamed-Jalal Fadili,et al.  Stein Unbiased GrAdient estimator of the Risk (SUGAR) for Multiple Parameter Selection , 2014, SIAM J. Imaging Sci..

[7]  I. Daubechies,et al.  An iterative thresholding algorithm for linear inverse problems with a sparsity constraint , 2003, math/0307152.

[8]  Sundeep Rangan,et al.  AMP-Inspired Deep Networks for Sparse Linear Inverse Problems , 2016, IEEE Transactions on Signal Processing.

[9]  R. Tibshirani,et al.  The solution path of the generalized lasso , 2010, 1005.1971.

[10]  P. Tseng,et al.  Block-Coordinate Gradient Descent Method for Linearly Constrained Nonsmooth Separable Optimization , 2009 .

[11]  Joaquim R. R. A. Martins,et al.  A Python surrogate modeling framework with derivatives , 2019, Adv. Eng. Softw..

[12]  Stephen P. Boyd,et al.  Differentiable Convex Optimization Layers , 2019, NeurIPS.

[13]  David D. Cox,et al.  Hyperopt: A Python Library for Optimizing the Hyperparameters of Machine Learning Algorithms , 2013, SciPy.

[14]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[15]  Charles Dossal,et al.  Local Behavior of Sparse Analysis Regularization: Applications to Risk Estimation , 2012, 1204.3212.

[16]  R. Tibshirani The Lasso Problem and Uniqueness , 2012, 1206.0313.

[17]  L. A. Stone,et al.  ESTIMATING WAIS IQ FROM SHIPLEY SCALE SCORES: ANOTHER CROSS-VALIDATION. , 1965, Journal of clinical psychology.

[18]  Jalal M. Fadili,et al.  The degrees of freedom of the Lasso for general design matrix , 2011, 1111.1162.

[19]  Lars Kai Hansen,et al.  Adaptive Regularization in Neural Network Modeling , 1996, Neural Networks: Tricks of the Trade.

[20]  Wen Gao,et al.  Maximal Sparsity with Deep Networks? , 2016, NIPS.

[21]  David Duvenaud,et al.  Optimizing Millions of Hyperparameters by Implicit Differentiation , 2019, AISTATS.

[22]  Sayan Mukherjee,et al.  Choosing Multiple Parameters for Support Vector Machines , 2002, Machine Learning.

[23]  J. Zico Kolter,et al.  OptNet: Differentiable Optimization as a Layer in Neural Networks , 2017, ICML.

[24]  Yin Zhang,et al.  Fixed-Point Continuation for l1-Minimization: Methodology and Convergence , 2008, SIAM J. Optim..

[25]  Jian Huang,et al.  COORDINATE DESCENT ALGORITHMS FOR NONCONVEX PENALIZED REGRESSION, WITH APPLICATIONS TO BIOLOGICAL FEATURE SELECTION. , 2011, The annals of applied statistics.

[26]  Chuan-Sheng Foo,et al.  Efficient multiple hyperparameter learning for log-linear models , 2007, NIPS.

[27]  Yiwen Guo,et al.  Sparse Coding with Gated Learned ISTA , 2020, ICLR.

[28]  S. Pandey,et al.  What Are Degrees of Freedom , 2008 .

[29]  Yoshua Bengio,et al.  Gradient-Based Optimization of Hyperparameters , 2000, Neural Computation.

[30]  Justin Domke,et al.  Generic Methods for Optimization-Based Modeling , 2012, AISTATS.

[31]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[32]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[33]  Thierry Blu,et al.  Monte-Carlo Sure: A Black-Box Optimization of Regularization Parameters for General Denoising Algorithms , 2008, IEEE Transactions on Image Processing.

[34]  Stephen J. Wright,et al.  Numerical Optimization (Springer Series in Operations Research and Financial Engineering) , 2000 .

[35]  H. Zou The Adaptive Lasso and Its Oracle Properties , 2006 .

[36]  R. Tibshirani,et al.  On the “degrees of freedom” of the lasso , 2007, 0712.0881.

[37]  Vlad Niculae,et al.  A Regularized Framework for Sparse and Structured Neural Attention , 2017, NIPS.

[38]  Jean Ponce,et al.  Task-Driven Dictionary Learning , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Paolo Frasconi,et al.  Forward and Reverse Gradient-Based Hyperparameter Optimization , 2017, ICML.

[40]  Patrick L. Combettes,et al.  Signal Recovery by Proximal Forward-Backward Splitting , 2005, Multiscale Model. Simul..

[41]  Karl Kunisch,et al.  A Bilevel Optimization Approach for Parameter Learning in Variational Models , 2013, SIAM J. Imaging Sci..

[42]  Fabian Pedregosa,et al.  Hyperparameter optimization with approximate gradient , 2016, ICML.

[43]  Yann LeCun,et al.  Learning Fast Approximations of Sparse Coding , 2010, ICML.

[44]  Mark W. Schmidt,et al.  Are we there yet? Manifold identification of gradient-related proximal methods , 2019, AISTATS.

[45]  Xiaohan Chen,et al.  ALISTA: Analytic Weights Are As Good As Learned Weights in LISTA , 2018, ICLR.

[46]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[47]  Gilles Aubert,et al.  A Unified View of Exact Continuous Penalties for ℓ2-ℓ0 Minimization , 2017, SIAM J. Optim..

[48]  B. Efron How Biased is the Apparent Error Rate of a Prediction Rule , 1986 .

[49]  L. Evans Measure theory and fine properties of functions , 1992 .

[50]  Mohamed-Jalal Fadili,et al.  The degrees of freedom of partly smooth regularizers , 2014, ArXiv.

[51]  J. Larsen,et al.  Design and regularization of neural networks: the optimal use of a validation set , 1996, Neural Networks for Signal Processing VI. Proceedings of the 1996 IEEE Signal Processing Society Workshop.

[52]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[53]  Siu Kwan Lam,et al.  Numba: a LLVM-based Python JIT compiler , 2015, LLVM '15.

[54]  C. Stein Estimation of the Mean of a Multivariate Normal Distribution , 1981 .

[55]  Alexandre Gramfort,et al.  Celer: a Fast Solver for the Lasso with Dual Extrapolation , 2018, ICML.

[56]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[57]  Olivier Teytaud,et al.  Critical Hyper-Parameters: No Random, No Cry , 2017, ArXiv.

[58]  Barak A. Pearlmutter,et al.  Automatic differentiation in machine learning: a survey , 2015, J. Mach. Learn. Res..

[59]  Alexandre Gramfort,et al.  Dual Extrapolation for Sparse Generalized Linear Models , 2019, ArXiv.

[60]  Yuhong Yang,et al.  Parametric or nonparametric? A parametricness index for model selection , 2011, 1202.0391.