论文信息 - Gradient-Based Optimization of Hyperparameters

Gradient-Based Optimization of Hyperparameters

Many machine learning algorithms can be formulated as the minimization of a training criterion that involves a hyperparameter. This hyperparameter is usually chosen by trial and error with a model selection criterion. In this article we present a methodology to optimize several hyper-parameters, based on the computation of the gradient of a model selection criterion with respect to the hyperparameters. In the case of a quadratic training criterion, the gradient of the selection criterion with respect to the hyperparameters is efficiently computed by backpropagating through a Cholesky decomposition. In the more general case, we show that the implicit function theorem can be used to derive a formula for the hyper-parameter gradient involving second derivatives of the training criterion.

Yoshua Bengio | Yoshua Bengio

[1] A. E. Hoerl,et al. Ridge Regression: Applications to Nonorthogonal Problems , 1970 .

[2] H. Akaike. A new look at the statistical model identification , 1974 .

[3] G. Wahba. Smoothing noisy data with spline functions , 1975 .

[4] A. N. Tikhonov,et al. Solutions of ill-posed problems , 1977 .

[5] Peter Craven,et al. Smoothing noisy data with spline functions , 1978 .

[6] Geoffrey E. Hinton. Learning Translation Invariant Recognition in Massively Parallel Networks , 1987, PARLE.

[7] Yann LeCun,et al. Improving the convergence of back-propagation learning with second-order methods , 1989 .

[8] Yann LeCun,et al. Optimal Brain Damage , 1989, NIPS.

[9] Pierre Priouret,et al. Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[10] Isabelle Guyon,et al. Structural Risk Minimization for Character Recognition , 1991, NIPS.

[11] Tomaso Poggio,et al. Computational vision and regularization theory , 1985, Nature.