Gradient-based Hyperparameter Optimization through Reversible Learning

Tuning hyperparameters of learning algorithms is hard because gradients are usually unavailable. We compute exact gradients of cross-validation performance with respect to all hyperparameters by chaining derivatives backwards through the entire training procedure. These gradients allow us to optimize thousands of hyperparameters, including step-size and momentum schedules, weight initialization distributions, richly parameterized regularization schemes, and neural network architectures. We compute hyperparameter gradients by exactly reversing the dynamics of stochastic gradient descent with momentum.

[1]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[2]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[3]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[4]  P. Hut,et al.  Building a better leapfrog , 1995 .

[5]  Barak A. Pearlmutter,et al.  An investigation of the gradient descent process in neural networks , 1996 .

[6]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[7]  Dingding Chen,et al.  Optimal use of regularization and cross-validation in neural network modeling , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[8]  R. Eigenmann,et al.  Gradient based adaptive regularization , 1999, Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468).

[9]  Yoshua Bengio,et al.  Gradient-Based Optimization of Hyperparameters , 2000, Neural Computation.

[10]  Carl E. Rasmussen,et al.  Derivative Observations in Gaussian Process Models of Dynamic Systems , 2002, NIPS.

[11]  Sayan Mukherjee,et al.  Choosing Multiple Parameters for Support Vector Machines , 2002, Machine Learning.

[12]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[13]  Chuan-Sheng Foo,et al.  Efficient multiple hyperparameter learning for log-linear models , 2007, NIPS.

[14]  Travis E. Oliphant,et al.  Python for Scientific Computing , 2007, Computing in Science & Engineering.

[15]  Ahmed H. Abdel-Gawad Adaptive optimization of hyperparameters in L 2-regularised logistic regression , 2007 .

[16]  Barak A. Pearlmutter,et al.  Reverse-mode AD in a functional framework: Lambda the ultimate backpropagator , 2008, TOPL.

[17]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[18]  Yoshua Bengio,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2010, AISTATS.

[19]  Yoshua Bengio,et al.  Algorithms for Hyper-Parameter Optimization , 2011, NIPS.

[20]  Kevin Leyton-Brown,et al.  Sequential Model-Based Optimization for General Algorithm Configuration , 2011, LION.

[21]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[22]  Lars Kai Hansen,et al.  Adaptive Regularization in Neural Network Modeling , 2012, Neural Networks: Tricks of the Trade.

[23]  Justin Domke,et al.  Generic Methods for Optimization-Based Modeling , 2012, AISTATS.

[24]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[25]  Razvan Pascanu,et al.  Understanding the exploding gradient problem , 2012, ArXiv.

[26]  Ilya Sutskever,et al.  Training Deep and Recurrent Networks with Hessian-Free Optimization , 2012, Neural Networks: Tricks of the Trade.

[27]  David D. Cox,et al.  Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures , 2013, ICML.

[28]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[29]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[30]  Yoshua Bengio,et al.  Training deep neural networks with low precision multiplications , 2014 .

[31]  Navdeep Jaitly,et al.  Multi-task Neural Networks for QSAR Predictions , 2014, ArXiv.

[32]  Barak A. Pearlmutter,et al.  Automatic Differentiation of Algorithms for Machine Learning , 2014, ArXiv.

[33]  Neil D. Lawrence,et al.  Nested Variational Compression in Deep Gaussian Processes , 2014, 1412.1370.

[34]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[35]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[36]  B. Lake Towards more human-like concept learning in machines : compositionality, causality, and learning-to-learn , 2014 .

[37]  Max Welling,et al.  Markov Chain Monte Carlo and Variational Inference: Bridging the Gap , 2014, ICML.

[38]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[39]  Yoshua Bengio,et al.  Low precision arithmetic for deep learning , 2014, ICLR.