Cost-Efficient Online Hyperparameter Optimization

Recent work on hyperparameters optimization (HPO) has shown the possibility of training certain hyperparameters together with regular parameters. However, these online HPO algorithms still require running evaluation on a set of validation examples at each training step, steeply increasing the training cost. To decide when to query the validation loss, we model online HPO as a time-varying Bayesian optimization problem, on top of which we propose a novel costly feedback setting to capture the concept of the query cost. Under this setting, standard algorithms are cost-inefficient as they evaluate on the validation set at every round. In contrast, the cost-efficient GP-UCB algorithm proposed in this paper queries the unknown function only when the model is less confident about current decisions. We evaluate our proposed algorithm by tuning hyperparameters online for VGG and ResNet on CIFAR-10 and ImageNet100. Our proposed online HPO algorithm reaches human expert-level performance within a single run of the experiment, while incurring only modest computational overhead compared to regular training.

[1]  David Duvenaud,et al.  Optimizing Millions of Hyperparameters by Implicit Differentiation , 2019, AISTATS.

[2]  Paolo Frasconi,et al.  Bilevel Programming for Hyperparameter Optimization and Meta-Learning , 2018, ICML.

[3]  Peter Auer,et al.  The Nonstochastic Multiarmed Bandit Problem , 2002, SIAM J. Comput..

[4]  Prabhat,et al.  Scalable Bayesian Optimization Using Deep Neural Networks , 2015, ICML.

[5]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[6]  Andreas Krause,et al.  Parallelizing Exploration-Exploitation Tradeoffs with Gaussian Process Bandit Optimization , 2012, ICML.

[7]  Ameet Talwalkar,et al.  Hyperband: Bandit-Based Configuration Evaluation for Hyperparameter Optimization , 2016, ICLR.

[8]  Andreas Krause,et al.  Contextual Gaussian Process Bandit Optimization , 2011, NIPS.

[9]  Aaron Klein,et al.  Learning Curve Prediction with Bayesian Neural Networks , 2016, ICLR.

[10]  Ameet Talwalkar,et al.  Non-stochastic Best Arm Identification and Hyperparameter Optimization , 2015, AISTATS.

[11]  Aaron Klein,et al.  Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets , 2016, AISTATS.

[12]  Volkan Cevher,et al.  Time-Varying Gaussian Process Bandit Optimization , 2016, AISTATS.

[13]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[14]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[15]  J. Larsen,et al.  Design and regularization of neural networks: the optimal use of a validation set , 1996, Neural Networks for Signal Processing VI. Proceedings of the 1996 IEEE Signal Processing Society Workshop.

[16]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[17]  Tapani Raiko,et al.  Scalable Gradient-Based Tuning of Continuous Regularization Hyperparameters , 2015, ICML.

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Yizong Cheng,et al.  Mean Shift, Mode Seeking, and Clustering , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[20]  Karen Willcox,et al.  Bayesian Optimization with a Finite Budget: An Approximate Dynamic Programming Approach , 2016, NIPS.

[21]  Sebastian Schulze,et al.  Bayesian Optimization for Iterative Learning , 2020, NeurIPS.

[22]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[23]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[24]  Leslie Pack Kaelbling,et al.  Algorithms for multi-armed bandit problems , 2014, ArXiv.

[25]  Daniel R. Jiang,et al.  BoTorch: Programmable Bayesian Optimization in PyTorch , 2019, ArXiv.

[26]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[27]  Roger B. Grosse,et al.  Self-Tuning Networks: Bilevel Optimization of Hyperparameters using Structured Best-Response Functions , 2019, ICLR.

[28]  Volkan Cevher,et al.  Adversarially Robust Optimization with Gaussian Processes , 2018, NeurIPS.

[29]  Kevin Leyton-Brown,et al.  Sequential Model-Based Optimization for General Algorithm Configuration , 2011, LION.

[30]  Nando de Freitas,et al.  Taking the Human Out of the Loop: A Review of Bayesian Optimization , 2016, Proceedings of the IEEE.

[31]  Issei Sato,et al.  Time-varying Gaussian Process Bandit Optimization with Non-constant Evaluation Time , 2020, ArXiv.

[32]  Alexandre Gramfort,et al.  Implicit differentiation of Lasso-type models for hyperparameter optimization , 2020, ICML.

[33]  David Duvenaud,et al.  Stochastic Hyperparameter Optimization through Hypernetworks , 2018, ArXiv.

[34]  Jasper Snoek,et al.  Freeze-Thaw Bayesian Optimization , 2014, ArXiv.

[35]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[36]  Fabian Pedregosa,et al.  Hyperparameter optimization with approximate gradient , 2016, ICML.

[37]  Justin Domke,et al.  Generic Methods for Optimization-Based Modeling , 2012, AISTATS.