论文信息 - Gradient-based Hyperparameter Optimization through Reversible Learning

Gradient-based Hyperparameter Optimization through Reversible Learning

Tuning hyperparameters of learning algorithms is hard because gradients are usually unavailable. We compute exact gradients of cross-validation performance with respect to all hyperparameters by chaining derivatives backwards through the entire training procedure. These gradients allow us to optimize thousands of hyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural network architectures. We compute hyperparameter gradients by exactly reversing the dynamics of stochastic gradient descent with momentum.

Ryan P. Adams | David Duvenaud | Dougal Maclaurin | D. Duvenaud | D. Maclaurin

[1] Lawrence D. Jackel,et al. Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[2] Barak A. Pearlmutter. Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[3] Yoshua Bengio,et al. Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[4] P. Hut,et al. Building a better leapfrog , 1995 .

[5] Barak A. Pearlmutter,et al. An investigation of the gradient descent process in neural networks , 1996 .

[6] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[7] Dingding Chen,et al. Optimal use of regularization and cross-validation in neural network modeling , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[8] R. Eigenmann,et al. Gradient based adaptive regularization , 1999, Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468).

[9] Yoshua Bengio,et al. Gradient-Based Optimization of Hyperparameters , 2000, Neural Computation.

[10] Carl E. Rasmussen,et al. Derivative Observations in Gaussian Process Models of Dynamic Systems , 2002, NIPS.

[11] Sayan Mukherjee,et al. Choosing Multiple Parameters for Support Vector Machines , 2002, Machine Learning.

[12] David J. C. MacKay,et al. Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[13] Chuan-Sheng Foo,et al. Efficient multiple hyperparameter learning for log-linear models , 2007, NIPS.

[14] Travis E. Oliphant,et al. Python for Scientific Computing , 2007, Computing in Science & Engineering.

[15] Ahmed H. Abdel-Gawad. Adaptive optimization of hyperparameters in L 2-regularised logistic regression , 2007 .

[16] Barak A. Pearlmutter,et al. Reverse-mode AD in a functional framework: Lambda the ultimate backpropagator , 2008, TOPL.

[17] Carl E. Rasmussen,et al. Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[18] Yoshua Bengio,et al. Why Does Unsupervised Pre-training Help Deep Learning? , 2010, AISTATS.

[19] Yoshua Bengio,et al. Algorithms for Hyper-Parameter Optimization , 2011, NIPS.

[20] Kevin Leyton-Brown,et al. Sequential Model-Based Optimization for General Algorithm Configuration , 2011, LION.

[21] Jasper Snoek,et al. Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[22] Lars Kai Hansen,et al. Adaptive Regularization in Neural Network Modeling , 2012, Neural Networks: Tricks of the Trade.

[23] Justin Domke,et al. Generic Methods for Optimization-Based Modeling , 2012, AISTATS.

[24] Razvan Pascanu,et al. Theano: new features and speed improvements , 2012, ArXiv.

[25] Razvan Pascanu,et al. Understanding the exploding gradient problem , 2012, ArXiv.

[26] Ilya Sutskever,et al. Training Deep and Recurrent Networks with Hessian-Free Optimization , 2012, Neural Networks: Tricks of the Trade.

[27] David D. Cox,et al. Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures , 2013, ICML.

[28] Razvan Pascanu,et al. On the difficulty of training recurrent neural networks , 2012, ICML.

[29] Geoffrey E. Hinton,et al. On the importance of initialization and momentum in deep learning , 2013, ICML.

[30] Yoshua Bengio,et al. Training deep neural networks with low precision multiplications , 2014 .

[31] Navdeep Jaitly,et al. Multi-task Neural Networks for QSAR Predictions , 2014, ArXiv.

[32] Barak A. Pearlmutter,et al. Automatic Differentiation of Algorithms for Machine Learning , 2014, ArXiv.

[33] Neil D. Lawrence,et al. Nested Variational Compression in Deep Gaussian Processes , 2014, 1412.1370.

[34] Alex Graves,et al. Neural Turing Machines , 2014, ArXiv.

[35] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[36] B. Lake. Towards more human-like concept learning in machines : compositionality, causality, and learning-to-learn , 2014 .

[37] Max Welling,et al. Markov Chain Monte Carlo and Variational Inference: Bridging the Gap , 2014, ICML.

[38] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[39] Yoshua Bengio,et al. Low precision arithmetic for deep learning , 2014, ICLR.