GSdyn: Learning training dynamics via online Gaussian optimization with gradient states

Bayesian optimization, whose efficiency for automatic hyperparameter tuning has been verified over the decade, still faces a standing dilemma between massive consumption of time and suboptimal search results. Although much effort has been devoted to accelerate and improve the optimizer, the dominantly time-consuming step of evaluation receives relatively less attention. In this paper, we propose a novel online Bayesian algorithm, which optimizes hyperparameters and learns the training dynamics to make it free from the repeated complete evaluations. To solve the non-stationary problem i.e. the same hyperparameters will lead to varying results at different training steps, we combine the training loss and the dominant eigenvalue to track training dynamics. Compared to traditional algorithms, it saves time and utilizes the important intermediate information which are not well leveraged by classical Bayesian methods that only focus on the final results. The performance on CIFAR-10 and CIFAR-100 verifies the efficacy of our approach.

[1]  Kurt Keutzer,et al.  Hessian-based Analysis of Large Batch Training and Robustness to Adversaries , 2018, NeurIPS.

[2]  Jeffrey Pennington,et al.  Geometry of Neural Network Loss Surfaces via Random Matrix Theory , 2017, ICML.

[3]  Oriol Vinyals,et al.  Qualitatively characterizing neural network optimization problems , 2014, ICLR.

[4]  Jasper Snoek,et al.  Multi-Task Bayesian Optimization , 2013, NIPS.

[5]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[6]  Ameet Talwalkar,et al.  Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization , 2016, J. Mach. Learn. Res..

[7]  Michael Minyi Zhang,et al.  Embarrassingly Parallel Inference for Gaussian Processes , 2017, J. Mach. Learn. Res..

[8]  Yann LeCun,et al.  Improving the convergence of back-propagation learning with second-order methods , 1989 .

[9]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[10]  Dacheng Tao,et al.  Control Batch Size and Learning Rate to Generalize Well: Theoretical and Empirical Evidence , 2019, NeurIPS.

[11]  Zhanxing Zhu,et al.  Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes , 2017, ArXiv.

[12]  Aaron Klein,et al.  Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets , 2016, AISTATS.

[13]  Ioannis Mitliagkas,et al.  Accelerated Stochastic Power Iteration , 2017, AISTATS.

[14]  Lyudmila Mihaylova,et al.  Ensemble Kalman Filtering for Online Gaussian Process Regression and Learning , 2018, 2018 21st International Conference on Information Fusion (FUSION).

[15]  Leslie N. Smith,et al.  A disciplined approach to neural network hyper-parameters: Part 1 - learning rate, batch size, momentum, and weight decay , 2018, ArXiv.

[16]  Aaron Klein,et al.  Learning Curve Prediction with Bayesian Neural Networks , 2016, ICLR.

[17]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[18]  Aaron Klein,et al.  BOHB: Robust and Efficient Hyperparameter Optimization at Scale , 2018, ICML.

[19]  Nando de Freitas,et al.  Taking the Human Out of the Loop: A Review of Bayesian Optimization , 2016, Proceedings of the IEEE.

[20]  Yoshua Bengio,et al.  A Walk with SGD , 2018, ArXiv.

[21]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Yoshua Bengio,et al.  On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length , 2018, ICLR.

[23]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[24]  Yann LeCun,et al.  Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond , 2016, 1611.07476.

[25]  Peter I. Frazier,et al.  A Tutorial on Bayesian Optimization , 2018, ArXiv.

[26]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[27]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[28]  Stefano Soatto,et al.  Stochastic Gradient Descent Performs Variational Inference, Converges to Limit Cycles for Deep Networks , 2017, 2018 Information Theory and Applications Workshop (ITA).

[29]  Chunpeng Wu,et al.  SmoothOut: Smoothing Out Sharp Minima to Improve Generalization in Deep Learning , 2018, 1805.07898.

[30]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[31]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[32]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[33]  Peter Green,et al.  Markov chain Monte Carlo in Practice , 1996 .