论文信息 - Faster & More Reliable Tuning of Neural Networks: Bayesian Optimization with Importance Sampling

Faster & More Reliable Tuning of Neural Networks: Bayesian Optimization with Importance Sampling

Many contemporary machine learning models require extensive tuning of hyperparameters to perform well. A variety of methods, such as Bayesian optimization, have been developed to automate and expedite this process. However, tuning remains costly as it typically requires repeatedly fully training models. To address this issue, Bayesian optimization has been extended to use cheap, partially trained models to extrapolate to expensive complete models. This approach enlarges the set of explored hyperparameters, but including many low-fidelity observations adds to the intrinsic randomness of the procedure and makes extrapolation challenging. We propose to accelerate tuning of neural networks in a robust way by taking into account the relative amount of information contributed by each training example. To do so, we leverage importance sampling (IS); this significantly increases the quality of the function evaluations, but also their runtime, and so must be done carefully. Casting hyperparameter search as a multi-task Bayesian optimization problem over both hyperparameters and IS design achieves the best of both worlds. By learning a parameterization of IS that tradeso↵ evaluation complexity and quality, our method improves upon validation error, in the average and worst-case, while using higher fidelity observations with less data. We show that this results in more reliable performance of our method in less wall-clock time across a variety of datasets and neural architectures. ⇤Work done while at Northeastern University. Proceedings of the 24 International Conference on Artificial Intelligence and Statistics (AISTATS) 2021, San Diego, California, USA. PMLR: Volume 130. Copyright 2021 by the author(s).

[1] Deanna Needell,et al. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.

[2] Donald R. Jones,et al. Efficient Global Optimization of Expensive Black-Box Functions , 1998, J. Glob. Optim..

[3] Jasper Snoek,et al. Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[4] Aaron Klein,et al. Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets , 2016, AISTATS.

[5] D. Dennis,et al. A statistical method for global optimization , 1992, [Proceedings] 1992 IEEE International Conference on Systems, Man, and Cybernetics.

[6] Yann LeCun,et al. The mnist database of handwritten digits , 2005 .

[7] Jasper Snoek,et al. Freeze-Thaw Bayesian Optimization , 2014, ArXiv.

[8] D. Dennis,et al. SDO : A Statistical Method for Global Optimization , 1997 .

[9] Zhihua Zhang,et al. CPSG-MCMC: Clustering-Based Preprocessing method for Stochastic Gradient MCMC , 2017, AISTATS.

[10] Hedvig Kjellstrom,et al. Determinantal Point Processes for Mini-Batch Diversification , 2017, UAI 2017.

[11] Tong Zhang,et al. Stochastic Optimization with Importance Sampling for Regularized Loss Minimization , 2014, ICML.

[12] Jonas Mockus,et al. On Bayesian Methods for Seeking the Extremum , 1974, Optimization Techniques.

[13] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[14] Edwin V. Bonilla,et al. Multi-task Gaussian Process Prediction , 2007, NIPS.

[15] Andrew McCallum,et al. Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[16] Matthias Poloczek,et al. Bayesian Optimization with Gradients , 2017, NIPS.

[17] Frank Hutter,et al. Speeding Up Automatic Hyperparameter Optimization of Deep Neural Networks by Extrapolation of Learning Curves , 2015, IJCAI.