Practical Multi-fidelity Bayesian Optimization for Hyperparameter Tuning

Bayesian optimization is popular for optimizing time-consuming black-box objectives. Nonetheless, for hyperparameter tuning in deep neural networks, the time required to evaluate the validation error for even a few hyperparameter settings remains a bottleneck. Multi-fidelity optimization promises relief using cheaper proxies to such objectives --- for example, validation error for a network trained using a subset of the training points or fewer iterations than required for convergence. We propose a highly flexible and practical approach to multi-fidelity Bayesian optimization, focused on efficiently optimizing hyperparameters for iteratively trained supervised learning models. We introduce a new acquisition function, the trace-aware knowledge-gradient, which efficiently leverages both multiple continuous fidelity controls and trace observations --- values of the objective at a sequence of fidelities, available when varying fidelity using training iterations. We provide a provably convergent method for optimizing our acquisition function and show it outperforms state-of-the-art alternatives for hyperparameter tuning of deep neural networks and large-scale kernel learning.

[1]  Jasper Snoek,et al.  Freeze-Thaw Bayesian Optimization , 2014, ArXiv.

[2]  Peter I. Frazier,et al.  Parallel Bayesian Global Optimization of Expensive Functions , 2016, Oper. Res..

[3]  P. L’Ecuyer,et al.  A Unified View of the IPA, SF, and LR Gradient Estimation Techniques , 1990 .

[4]  Donald R. Jones,et al.  Efficient Global Optimization of Expensive Black-Box Functions , 1998, J. Glob. Optim..

[5]  Peter I. Frazier,et al.  A Tutorial on Bayesian Optimization , 2018, ArXiv.

[6]  Paul R. Milgrom,et al.  Envelope Theorems for Arbitrary Choice Sets , 2002 .

[7]  Kirthevasan Kandasamy,et al.  Multi-fidelity Bayesian Optimisation with Continuous Approximations , 2017, ICML.

[8]  Stephen J. Roberts,et al.  Practical Bayesian Optimization for Variable Cost Objectives , 2017, 1703.04335.

[9]  Warren B. Powell,et al.  The Knowledge-Gradient Policy for Correlated Normal Beliefs , 2009, INFORMS J. Comput..

[10]  R. Bartle The elements of integration and Lebesgue measure , 1995 .

[11]  Matthew W. Hoffman,et al.  Predictive Entropy Search for Efficient Global Optimization of Black-box Functions , 2014, NIPS.

[12]  R. A. Miller,et al.  Sequential kriging optimization using multiple-fidelity evaluations , 2006 .

[13]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[14]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[15]  Karen Willcox,et al.  Multifidelity Optimization using Statistical Surrogate Modeling for Non-Hierarchical Information Sources , 2015 .

[16]  Matthias Poloczek,et al.  Bayesian Optimization with Gradients , 2017, NIPS.

[17]  Matthias Poloczek,et al.  Multi-Information Source Optimization , 2016, NIPS.

[18]  Jian Wu,et al.  Knowledge Gradient Methods for Bayesian Optimization , 2017 .

[19]  Daniel Foreman-Mackey,et al.  emcee: The MCMC Hammer , 2012, 1202.3665.

[20]  Aaron Klein,et al.  BOHB: Robust and Efficient Hyperparameter Optimization at Scale , 2018, ICML.

[21]  Ronald A. Howard,et al.  Information Value Theory , 1966, IEEE Trans. Syst. Sci. Cybern..

[22]  S. P. Smith Differentiation of the Cholesky Algorithm , 1995 .

[23]  Ameet Talwalkar,et al.  Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization , 2016, J. Mach. Learn. Res..

[24]  J. Mockus The Bayesian Approach to Local Optimization , 1989 .

[25]  Aaron Klein,et al.  Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets , 2016, AISTATS.

[26]  F. Hutter,et al.  Towards efficient Bayesian Optimization for Big Data , 2015 .

[27]  Peter I. Frazier,et al.  The Parallel Knowledge Gradient Method for Batch Bayesian Optimization , 2016, NIPS.

[28]  Eduardo C. Garrido-Merchán,et al.  Dealing with Categorical and Integer-valued Variables in Bayesian Optimization with Gaussian Processes , 2017, Neurocomputing.

[29]  Philipp Hennig,et al.  Entropy Search for Information-Efficient Global Optimization , 2011, J. Mach. Learn. Res..

[30]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[31]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[32]  Aaron Klein,et al.  RoBO : A Flexible and Robust Bayesian Optimization Framework in Python , 2017 .

[33]  Andrew Gordon Wilson,et al.  Kernel Interpolation for Scalable Structured Gaussian Processes (KISS-GP) , 2015, ICML.

[34]  Frank Hutter,et al.  Speeding Up Automatic Hyperparameter Optimization of Deep Neural Networks by Extrapolation of Learning Curves , 2015, IJCAI.