Scalable One-Pass Optimisation of High-Dimensional Weight-Update Hyperparameters by Implicit Differentiation

Machine learning training methods depend plentifully and intricately on hyperparameters, motivating automated strategies for their optimisation. Many existing algorithms restart training for each new hyperparameter choice, at considerable computational cost. Some hypergradient-based one-pass methods exist, but these either cannot be applied to arbitrary optimiser hyperparameters (such as learning rates and momenta) or take several times longer to train than their base models. We extend these existing methods to develop an approximate hypergradient-based hyperparameter optimiser which is applicable to any continuous hyperparameter appearing in a differentiable model weight update, yet requires only one training episode, with no restarts. We also provide a motivating argument for convergence to the true hypergradient, and perform tractable gradient-based optimisation of independent learning rates for each model parameter. Our method performs competitively from varied random hyperparameter initialisations on several UCI datasets and Fashion-MNIST (using a one-layer MLP), Penn Treebank (using an LSTM) and CIFAR-10 (using a ResNet-18), in time only 2–3x greater than vanilla training.

[1]  Ryan P. Adams,et al.  Gradient-based Hyperparameter Optimization through Reversible Learning , 2015, ICML.

[2]  P. Chaudhari,et al.  Learning the Learning Rate for Gradient Descent by Gradient Descent , 2019 .

[3]  Nando de Freitas,et al.  Bayesian Optimization in a Billion Dimensions via Random Embeddings , 2013, J. Artif. Intell. Res..

[4]  Christian Gagné,et al.  Bayesian optimization for conditional hyperparameter spaces , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[5]  Minh-Ngoc Tran,et al.  Adaptive Multi-level Hyper-gradient Descent , 2020, ArXiv.

[6]  Xavier Bouthillier,et al.  Survey of machine-learning experimental methods at NeurIPS2019 and ICLR2020 , 2020 .

[7]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[8]  Aaron Klein,et al.  Learning Curve Prediction with Bayesian Neural Networks , 2016, ICLR.

[9]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Ron Kohavi,et al.  Automatic Parameter Selection by Minimizing Estimated Error , 1995, ICML.

[11]  Ameet Talwalkar,et al.  Non-stochastic Best Arm Identification and Hyperparameter Optimization , 2015, AISTATS.

[12]  Xiaoming Yuan,et al.  A Generic First-Order Algorithmic Framework for Bi-Level Programming Beyond Lower-Level Singleton , 2020, ICML.

[13]  Paolo Frasconi,et al.  Forward and Reverse Gradient-Based Hyperparameter Optimization , 2017, ICML.

[14]  Ian R. Lane,et al.  Speeding up Hyper-parameter Optimization by Extrapolation of Learning Curves Using Previous Builds , 2017, ECML/PKDD.

[15]  J. Larsen,et al.  Design and regularization of neural networks: the optimal use of a validation set , 1996, Neural Networks for Signal Processing VI. Proceedings of the 1996 IEEE Signal Processing Society Workshop.

[16]  Amos Storkey,et al.  Non-greedy Gradient-based Hyperparameter Optimization Over Long Horizons , 2020, ArXiv.

[17]  Justin Domke,et al.  Generic Methods for Optimization-Based Modeling , 2012, AISTATS.

[18]  Danlu Chen,et al.  Neural Optimizers with Hypergradients for Tuning Parameter-Wise Learning Rates , 2017 .

[19]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[20]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[21]  Jason Xu,et al.  Combination of Hyperband and Bayesian Optimization for Hyperparameter Optimization in Deep Learning , 2018, ArXiv.

[22]  Heng Huang,et al.  Improved Bilevel Model: Fast and Optimal Algorithm with Theoretical Guarantee , 2020, ArXiv.

[23]  Paolo Frasconi,et al.  Marthe: Scheduling the Learning Rate Via Online Hypergradients , 2019, IJCAI.

[24]  Tim Oates,et al.  Efficient progressive sampling , 1999, KDD '99.

[25]  Kevin Leyton-Brown,et al.  Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms , 2012, KDD.

[26]  Prabhat,et al.  Scalable Bayesian Optimization Using Deep Neural Networks , 2015, ICML.

[27]  David Duvenaud,et al.  Optimizing Millions of Hyperparameters by Implicit Differentiation , 2019, AISTATS.

[28]  Lisa Zhang,et al.  Reviving and Improving Recurrent Back-Propagation , 2018, ICML.

[29]  Roger B. Grosse,et al.  Self-Tuning Networks: Bilevel Optimization of Hyperparameters using Structured Best-Response Functions , 2019, ICLR.

[30]  Matthias Seeger,et al.  Learning search spaces for Bayesian optimization: Another view of hyperparameter transfer learning , 2019, NeurIPS.

[31]  Renjie Liao,et al.  Understanding Short-Horizon Bias in Stochastic Meta-Optimization , 2018, ICLR.

[32]  Gerald Tesauro,et al.  Selecting Near-Optimal Learners via Incremental Data Allocation , 2015, AAAI.

[33]  Tapani Raiko,et al.  Scalable Gradient-Based Tuning of Continuous Regularization Hyperparameters , 2015, ICML.

[34]  Artem Molchanov,et al.  Generalized Inner Loop Meta-Learning , 2019, ArXiv.

[35]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[36]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[37]  Johann Petrak,et al.  Fast Subsampling Performance Estimates for Classification Algorithm Selection , 2000 .

[38]  Tim Kraska,et al.  Automating model search for large scale machine learning , 2015, SoCC.

[39]  Jasper Snoek,et al.  Freeze-Thaw Bayesian Optimization , 2014, ArXiv.

[40]  Mark W. Schmidt,et al.  Online Learning Rate Adaptation with Hypergradient Descent , 2017, ICLR.

[41]  Kirthevasan Kandasamy,et al.  Multi-fidelity Bayesian Optimisation with Continuous Approximations , 2017, ICML.

[42]  Paolo Frasconi,et al.  Bilevel Programming for Hyperparameter Optimization and Meta-Learning , 2018, ICML.

[43]  Ameet Talwalkar,et al.  Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization , 2016, J. Mach. Learn. Res..

[44]  Baltasar Beferull-Lozano,et al.  Online Hyperparameter Search Interleaved with Proximal Parameter Updates , 2020, 2020 28th European Signal Processing Conference (EUSIPCO).

[45]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[46]  Yoshua Bengio,et al.  Gradient-Based Optimization of Hyperparameters , 2000, Neural Computation.

[47]  Carl E. Rasmussen,et al.  In Advances in Neural Information Processing Systems , 2011 .

[48]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[49]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[50]  Byron Boots,et al.  Truncated Back-propagation for Bilevel Optimization , 2018, AISTATS.

[51]  Jihun Hamm,et al.  Penalty Method for Inversion-Free Deep Bilevel Optimization , 2019, ArXiv.

[52]  Isabelle Bloch,et al.  Hyperparameter optimization of deep neural networks: combining Hperband with Bayesian model selection , 2017 .

[53]  Aaron Klein,et al.  BOHB: Robust and Efficient Hyperparameter Optimization at Scale , 2018, ICML.

[54]  Mikio L. Braun,et al.  Fast cross-validation via sequential testing , 2012, J. Mach. Learn. Res..

[55]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[56]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[57]  Isabelle Guyon,et al.  Bayesian Optimization is Superior to Random Search for Machine Learning Hyperparameter Tuning: Analysis of the Black-Box Optimization Challenge 2020 , 2021, NeurIPS.

[58]  Antal van den Bosch Wrapped progressive sampling search for optimizing learning algorithm parameters , 2005 .

[59]  Frank Hutter,et al.  Speeding Up Automatic Hyperparameter Optimization of Deep Neural Networks by Extrapolation of Learning Curves , 2015, IJCAI.

[60]  Michael A. Osborne,et al.  Raiders of the Lost Architecture: Kernels for Bayesian Optimization in Conditional Parameter Spaces , 2014, 1409.4011.

[61]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[62]  David Duvenaud,et al.  Stochastic Hyperparameter Optimization through Hypernetworks , 2018, ArXiv.