论文信息 - Continuous optimization of hyper-parameters

Continuous optimization of hyper-parameters

Many machine learning algorithms can be formulated as the minimization of a training criterion which involves a hyper-parameter. This hyper-parameter is usually chosen by trial and error with a model selection criterion. In this paper we present a methodology to optimize several hyper-parameters, based on the computation of the gradient of a model selection criterion with respect to the hyper-parameters. In the case of a quadratic training criterion, the gradient of the selection criterion with respect to the hyper-parameters is efficiently computed by back-propagating through a Cholesky decomposition. In the more general case, we show that the implicit function theorem can be used to derive a formula for the hyper-parameter gradient involving second derivatives of the training criterion.

Yoshua Bengio | Yoshua Bengio

[1] H. Akaike. A new look at the statistical model identification , 1974 .

[2] G. Wahba. Smoothing noisy data with spline functions , 1975 .

[3] A. N. Tikhonov,et al. Solutions of ill-posed problems , 1977 .

[4] Peter Craven,et al. Smoothing noisy data with spline functions , 1978 .

[5] Yann LeCun,et al. Improving the convergence of back-propagation learning with second-order methods , 1989 .

[6] Yann LeCun,et al. Optimal Brain Damage , 1989, NIPS.

[7] Chris Bishop,et al. Exact Calculation of the Hessian Matrix for the Multilayer Perceptron , 1992, Neural Computation.

[8] S. P. Smith. Differentiation of the Cholesky Algorithm , 1995 .

[9] Jorma Rissanen,et al. Stochastic Complexity in Statistical Inquiry , 1989, World Scientific Series in Computer Science.

[10] David J. C. MacKay,et al. Bayesian methods for supervised neural networks , 1998 .

[11] Dirk Husmeier. Automatic Relevance Determination (ARD) , 1999 .

[12] Yoshua Bengio,et al. Learning Simple Non Stationarities with Hyper Parameters , 1999 .