Gradient-Based Optimization of Hyperparameters

Many machine learning algorithms can be formulated as the minimization of a training criterion that involves a hyperparameter. This hyperparameter is usually chosen by trial and error with a model selection criterion. In this article we present a methodology to optimize several hyper-parameters, based on the computation of the gradient of a model selection criterion with respect to the hyperparameters. In the case of a quadratic training criterion, the gradient of the selection criterion with respect to the hyperparameters is efficiently computed by backpropagating through a Cholesky decomposition. In the more general case, we show that the implicit function theorem can be used to derive a formula for the hyper-parameter gradient involving second derivatives of the training criterion.

[1]  A. E. Hoerl,et al.  Ridge Regression: Applications to Nonorthogonal Problems , 1970 .

[2]  H. Akaike A new look at the statistical model identification , 1974 .

[3]  G. Wahba Smoothing noisy data with spline functions , 1975 .

[4]  A. N. Tikhonov,et al.  Solutions of ill-posed problems , 1977 .

[5]  Peter Craven,et al.  Smoothing noisy data with spline functions , 1978 .

[6]  Geoffrey E. Hinton Learning Translation Invariant Recognition in Massively Parallel Networks , 1987, PARLE.

[7]  Yann LeCun,et al.  Improving the convergence of back-propagation learning with second-order methods , 1989 .

[8]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[9]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[10]  Isabelle Guyon,et al.  Structural Risk Minimization for Character Recognition , 1991, NIPS.

[11]  Tomaso Poggio,et al.  Computational vision and regularization theory , 1985, Nature.

[12]  Chris Bishop,et al.  Exact Calculation of the Hessian Matrix for the Multilayer Perceptron , 1992, Neural Computation.

[13]  David Mackay,et al.  Probable networks and plausible predictions - a review of practical Bayesian methods for supervised neural networks , 1995 .

[14]  Lars Kai Hansen,et al.  Adaptive Regularization in Neural Network Modeling , 1996, Neural Networks: Tricks of the Trade.

[15]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[16]  L. Breiman Heuristics of instability and stabilization in model selection , 1996 .

[17]  Jorma Rissanen,et al.  Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[18]  Radford M. Neal Assessing Relevance determination methods using DELVE , 1998 .

[19]  Yves Grandvalet Least Absolute Shrinkage is Equivalent to Quadratic Penalization , 1998 .

[20]  Yoshua Bengio,et al.  Learning Simple Non Stationarities with Hyper Parameters , 1999 .

[21]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .