Gradient-based Regularization Parameter Selection for Problems With Nonsmooth Penalty Functions

ABSTRACT In high-dimensional and/or nonparametric regression problems , regularization (or penalization) is used to control model complexity and induce desired structure. Each penalty has a weight parameter that indicates how strongly the structure corresponding to that penalty should be enforced. Typically, the parameters are chosen to minimize the error on a separate validation set using a simple grid search or a gradient-free optimization method. It is more efficient to tune parameters if the gradient can be determined, but this is often difficult for problems with nonsmooth penalty functions. Here, we show that for many penalized regression problems, the validation loss is actually smooth almost-everywhere with respect to the penalty parameters. We can, therefore, apply a modified gradient descent algorithm to tune parameters. Through simulation studies on example regression problems, we find that increasing the number of penalty parameters and tuning them using our method can decrease the generalization error.

[1]  John A. Nelder,et al.  A Simplex Method for Function Minimization , 1965, Comput. J..

[2]  Franz Rellich,et al.  Perturbation Theory of Eigenvalue Problems , 1969 .

[3]  H. Wold Soft Modelling by Latent Variables: The Non-Linear Iterative Partial Least Squares (NIPALS) Approach , 1975, Journal of Applied Probability.

[4]  G. Wahba Spline Interpolation and Smoothing on the Sphere , 1981 .

[5]  G. Watson Characterization of the subdifferential of some matrix norms , 1992 .

[6]  I. Johnstone,et al.  Ideal spatial adaptation by wavelet shrinkage , 1994 .

[7]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[8]  Lars Kai Hansen,et al.  Adaptive Regularization in Neural Network Modeling , 1996, Neural Networks: Tricks of the Trade.

[9]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[10]  S. Geer,et al.  Locally adaptive regression splines , 1997 .

[11]  Yoshua Bengio,et al.  Gradient-Based Optimization of Hyperparameters , 2000, Neural Computation.

[12]  J. van Leeuwen,et al.  Neural Networks: Tricks of the Trade , 2002, Lecture Notes in Computer Science.

[13]  H. Zou,et al.  Regression Shrinkage and Selection via the Elastic Net , with Applications to Microarrays , 2003 .

[14]  Nathan Srebro,et al.  Learning with matrix factorizations , 2004 .

[15]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[16]  R. Tibshirani,et al.  Sparsity and smoothness via the fused lasso , 2005 .

[17]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[18]  M. Cotreau,et al.  Molecular classification of Crohn's disease and ulcerative colitis patients using transcriptional profiles in peripheral blood mononuclear cells. , 2006, The Journal of molecular diagnostics : JMD.

[19]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[20]  Chuan-Sheng Foo,et al.  Efficient multiple hyperparameter learning for log-linear models , 2007, NIPS.

[21]  Y. Yao,et al.  On Early Stopping in Gradient Descent Learning , 2007 .

[22]  D. Lizotte Practical bayesian optimization , 2008 .

[23]  Stephen P. Boyd,et al.  1 Trend Filtering , 2009, SIAM Rev..

[24]  Alexandre B. Tsybakov,et al.  Introduction to Nonparametric Estimation , 2008, Springer series in statistics.

[25]  Peter J. Ramadge,et al.  Descent Methods for Tuning Parameter Refinement , 2010, AISTATS.

[26]  Robert Tibshirani,et al.  Spectral Regularization Algorithms for Learning Large Incomplete Matrices , 2010, J. Mach. Learn. Res..

[27]  Yoshua Bengio,et al.  Algorithms for Hyper-Parameter Optimization , 2011, NIPS.

[28]  R. Tibshirani,et al.  The solution path of the generalized lasso , 2010, 1005.1971.

[29]  Sara van de Geer,et al.  Statistics for High-Dimensional Data , 2011 .

[30]  Kevin Leyton-Brown,et al.  Sequential Model-Based Optimization for General Algorithm Configuration , 2011, LION.

[31]  Sara van de Geer,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[32]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[33]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[34]  Scalable Convex Methods for Flexible Low-Rank Matrix Modeling , 2013 .

[35]  Noah Simon,et al.  A Sparse-Group Lasso , 2013 .

[36]  Sara van de Geer,et al.  Penalized least squares estimation in the additive model with different smoothness for the components , 2014, 1405.6584.

[37]  Ryan J. Tibshirani,et al.  Degrees of freedom and model search , 2014, 1402.1920.

[38]  Ryan P. Adams,et al.  Gradient-based Hyperparameter Optimization through Reversible Learning , 2015, ICML.

[39]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[40]  Stephen P. Boyd,et al.  CVXPY: A Python-Embedded Modeling Language for Convex Optimization , 2016, J. Mach. Learn. Res..