Faster & More Reliable Tuning of Neural Networks: Bayesian Optimization with Importance Sampling

Many contemporary machine learning models require extensive tuning of hyperparameters to perform well. A variety of methods, such as Bayesian optimization, have been developed to automate and expedite this process. However, tuning remains costly as it typically requires repeatedly fully training models. To address this issue, Bayesian optimization has been extended to use cheap, partially trained models to extrapolate to expensive complete models. This approach enlarges the set of explored hyperparameters, but including many low-fidelity observations adds to the intrinsic randomness of the procedure and makes extrapolation challenging. We propose to accelerate tuning of neural networks in a robust way by taking into account the relative amount of information contributed by each training example. To do so, we leverage importance sampling (IS); this significantly increases the quality of the function evaluations, but also their runtime, and so must be done carefully. Casting hyperparameter search as a multi-task Bayesian optimization problem over both hyperparameters and IS design achieves the best of both worlds. By learning a parameterization of IS that tradeso↵ evaluation complexity and quality, our method improves upon validation error, in the average and worst-case, while using higher fidelity observations with less data. We show that this results in more reliable performance of our method in less wall-clock time across a variety of datasets and neural architectures. ⇤Work done while at Northeastern University. Proceedings of the 24 International Conference on Artificial Intelligence and Statistics (AISTATS) 2021, San Diego, California, USA. PMLR: Volume 130. Copyright 2021 by the author(s).

[1]  Deanna Needell,et al.  Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.

[2]  Donald R. Jones,et al.  Efficient Global Optimization of Expensive Black-Box Functions , 1998, J. Glob. Optim..

[3]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[4]  Aaron Klein,et al.  Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets , 2016, AISTATS.

[5]  D. Dennis,et al.  A statistical method for global optimization , 1992, [Proceedings] 1992 IEEE International Conference on Systems, Man, and Cybernetics.

[6]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[7]  Jasper Snoek,et al.  Freeze-Thaw Bayesian Optimization , 2014, ArXiv.

[8]  D. Dennis,et al.  SDO : A Statistical Method for Global Optimization , 1997 .

[9]  Zhihua Zhang,et al.  CPSG-MCMC: Clustering-Based Preprocessing method for Stochastic Gradient MCMC , 2017, AISTATS.

[10]  Hedvig Kjellstrom,et al.  Determinantal Point Processes for Mini-Batch Diversification , 2017, UAI 2017.

[11]  Tong Zhang,et al.  Stochastic Optimization with Importance Sampling for Regularized Loss Minimization , 2014, ICML.

[12]  Jonas Mockus,et al.  On Bayesian Methods for Seeking the Extremum , 1974, Optimization Techniques.

[13]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[14]  Edwin V. Bonilla,et al.  Multi-task Gaussian Process Prediction , 2007, NIPS.

[15]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[16]  Matthias Poloczek,et al.  Bayesian Optimization with Gradients , 2017, NIPS.

[17]  Frank Hutter,et al.  Speeding Up Automatic Hyperparameter Optimization of Deep Neural Networks by Extrapolation of Learning Curves , 2015, IJCAI.

[18]  Jasper Snoek,et al.  Multi-Task Bayesian Optimization , 2013, NIPS.

[19]  Philipp Hennig,et al.  Entropy Search for Information-Efficient Global Optimization , 2011, J. Mach. Learn. Res..

[20]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[21]  Kian Hsiang Low,et al.  Bayesian Optimization Meets Bayesian Optimal Stopping , 2019, ICML.

[22]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[23]  Alexander I. J. Forrester,et al.  Multi-fidelity optimization via surrogate modelling , 2007, Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[24]  Tyler B. Johnson,et al.  Training Deep Models Faster with Robust, Approximate Importance Sampling , 2018, NeurIPS.

[25]  Mark W. Schmidt,et al.  Non-Uniform Stochastic Average Gradient Method for Training Conditional Random Fields , 2015, AISTATS.

[26]  R. A. Miller,et al.  Sequential kriging optimization using multiple-fidelity evaluations , 2006 .

[27]  R JonesDonald,et al.  Efficient Global Optimization of Expensive Black-Box Functions , 1998 .

[28]  Yoshua Bengio,et al.  Algorithms for Hyper-Parameter Optimization , 2011, NIPS.

[29]  D. Sculley,et al.  Google Vizier: A Service for Black-Box Optimization , 2017, KDD.

[30]  François Fleuret,et al.  Not All Samples Are Created Equal: Deep Learning with Importance Sampling , 2018, ICML.

[31]  Jason Xu,et al.  Combination of Hyperband and Bayesian Optimization for Hyperparameter Optimization in Deep Learning , 2018, ArXiv.

[32]  Isabelle Bloch,et al.  Hyperparameter optimization of deep neural networks: combining Hperband with Bayesian model selection , 2017 .

[33]  Marius Lindauer,et al.  Towards Assessing the Impact of Bayesian Optimization's Own Hyperparameters , 2019, ArXiv.

[34]  Aaron Klein,et al.  BOHB: Robust and Efficient Hyperparameter Optimization at Scale , 2018, ICML.

[35]  Jakob Bossek,et al.  Initial design strategies and their effects on sequential model-based optimization: an exploratory case study based on BBOB , 2020, GECCO.

[36]  Matthew W. Hoffman,et al.  Predictive Entropy Search for Efficient Global Optimization of Black-box Functions , 2014, NIPS.

[37]  Ameet Talwalkar,et al.  Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization , 2016, J. Mach. Learn. Res..

[38]  Kirthevasan Kandasamy,et al.  Gaussian Process Bandit Optimisation with Multi-fidelity Evaluations , 2016, NIPS.

[39]  Frank Hutter,et al.  Online Batch Selection for Faster Training of Neural Networks , 2015, ArXiv.

[40]  Daniel Hern'andez-Lobato,et al.  Predictive Entropy Search for Multi-objective Bayesian Optimization with Constraints , 2016, Neurocomputing.