Population Based Training of Neural Networks

Neural networks dominate the modern machine learning landscape, but their training and success still suffer from sensitivity to empirical choices of hyperparameters such as model architecture, loss function, and optimisation algorithm. In this work we present \emph{Population Based Training (PBT)}, a simple asynchronous optimisation algorithm which effectively utilises a fixed computational budget to jointly optimise a population of models and their hyperparameters to maximise performance. Importantly, PBT discovers a schedule of hyperparameter settings rather than following the generally sub-optimal strategy of trying to find a single fixed set to use for the whole course of training. With just a small modification to a typical distributed hyperparameter training framework, our method allows robust and reliable training of models. We demonstrate the effectiveness of PBT on deep reinforcement learning problems, showing faster wall-clock convergence and higher final performance of agents by optimising over a suite of hyperparameters. In addition, we show the same method can be applied to supervised learning for machine translation, where PBT is used to maximise the BLEU score directly, and also to training of Generative Adversarial Networks to maximise the Inception score of generated images. In all cases PBT results in the automatic discovery of hyperparameter schedules and model selection which results in stable training and better final performance.

[1]  Welch Bl THE GENERALIZATION OF ‘STUDENT'S’ PROBLEM WHEN SEVERAL DIFFERENT POPULATION VARLANCES ARE INVOLVED , 1947 .

[2]  William M. Spears,et al.  Adapting Crossover in Evolutionary Algorithms , 1995, Evolutionary Programming.

[3]  James R. Wilson,et al.  Empirical Investigation of the Benefits of Partial Lamarckianism , 1997, Evolutionary Computation.

[4]  Man-Wai Mak,et al.  Exploring the effects of Lamarckian and Baldwinian learning in evolving recurrent neural networks , 1997, Proceedings of 1997 IEEE International Conference on Evolutionary Computation (ICEC '97).

[5]  Jürgen Schmidhuber,et al.  Probabilistic Incremental Program Evolution: Stochastic Search Through Program Space , 1997, ECML.

[6]  Thomas Bäck,et al.  An Overview of Parameter Control Methods by Self-Adaption in Evolutionary Algorithms , 1998, Fundam. Informaticae.

[7]  Juan Julián Merelo Guervós,et al.  G-Prop-III: Global Optimization of Multilayer Perceptrons using an Evolutionary Algorithm , 1999, GECCO.

[8]  M. Husken,et al.  Optimization for problem classes-neural networks that learn to learn , 2000, 2000 IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks. Proceedings of the First IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks (Cat. No.00.

[9]  Bartlomiej Gloger,et al.  Self-adaptive Evolutionary Algorithms , 2004 .

[10]  Christian Igel,et al.  Evolutionary Optimization of Neural Systems: The Use of Strategy Adaptation , 2005 .

[11]  Juan Julián Merelo Guervós,et al.  Lamarckian Evolution and the Baldwin Effect in Evolutionary Neural Networks , 2006, ArXiv.

[12]  Charles Ofria,et al.  Natural Selection Fails to Optimize Mutation Rates for Long-Term Adaptation on Rugged Fitness Landscapes , 2008, ECAL.

[13]  Andreas Krause,et al.  Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[14]  Yoshua Bengio,et al.  Algorithms for Hyper-Parameter Optimization , 2011, NIPS.

[15]  Kevin Leyton-Brown,et al.  Sequential Model-Based Optimization for General Algorithm Configuration , 2011, LION.

[16]  András György,et al.  Efficient Multi-Start Strategies for Local Search Algorithms , 2009, J. Artif. Intell. Res..

[17]  Jun Zhang,et al.  Evolutionary Computation Meets Machine Learning: A Survey , 2011, IEEE Computational Intelligence Magazine.

[18]  Peter L. Bartlett,et al.  Oracle inequalities for computationally budgeted model selection , 2011, COLT.

[19]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[20]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[21]  Jasper Snoek,et al.  Multi-Task Bayesian Optimization , 2013, NIPS.

[22]  Jasper Snoek,et al.  Freeze-Thaw Bayesian Optimization , 2014, ArXiv.

[23]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[24]  Steven R. Young,et al.  Optimizing deep learning hyper-parameters through an evolutionary algorithm , 2015, MLHPC@SC.

[25]  Prabhat,et al.  Scalable Bayesian Optimization Using Deep Neural Networks , 2015, ICML.

[26]  Zoubin Ghahramani,et al.  Parallel Predictive Entropy Search for Batch Global Optimization of Expensive Objective Functions , 2015, NIPS.

[27]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[28]  Frank Hutter,et al.  Speeding Up Automatic Hyperparameter Optimization of Deep Neural Networks by Extrapolation of Learning Curves , 2015, IJCAI.

[29]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[30]  Aaron Klein,et al.  Bayesian Optimization with Robust Bayesian Neural Networks , 2016, NIPS.

[31]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[32]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[33]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[34]  David Pfau,et al.  Convolution by Evolution: Differentiable Pattern Producing Networks , 2016, GECCO.

[35]  Peter I. Frazier,et al.  The Parallel Knowledge Gradient Method for Batch Bayesian Optimization , 2016, NIPS.

[36]  Neil D. Lawrence,et al.  Batch Bayesian Optimization via Local Penalization , 2015, AISTATS.

[37]  Xin Yao,et al.  A Survey on Evolutionary Computation Approaches to Feature Selection , 2016, IEEE Transactions on Evolutionary Computation.

[38]  Gerald Tesauro,et al.  Selecting Near-Optimal Learners via Incremental Data Allocation , 2015, AAAI.

[39]  Tom Schaul,et al.  FeUdal Networks for Hierarchical Reinforcement Learning , 2017, ICML.

[40]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[41]  Leslie N. Smith,et al.  Cyclical Learning Rates for Training Neural Networks , 2015, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[42]  Shakir Mohamed,et al.  Variational Approaches for Auto-Encoding Generative Adversarial Networks , 2017, ArXiv.

[43]  Tom Schaul,et al.  StarCraft II: A New Challenge for Reinforcement Learning , 2017, ArXiv.

[44]  Olatunji Ruwase,et al.  HyperDrive: exploring hyperparameters with POP scheduling , 2017, Middleware.

[45]  Ameet Talwalkar,et al.  Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization , 2016, J. Mach. Learn. Res..

[46]  D. Sculley,et al.  Google Vizier: A Service for Black-Box Optimization , 2017, KDD.

[47]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[48]  Dhruv Batra,et al.  LR-GAN: Layered Recursive Generative Adversarial Networks for Image Generation , 2016, ICLR.

[49]  Aaron Klein,et al.  Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets , 2016, AISTATS.

[50]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[51]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[52]  Quoc V. Le,et al.  Large-Scale Evolution of Image Classifiers , 2017, ICML.

[53]  Oriol Vinyals,et al.  Hierarchical Representations for Efficient Architecture Search , 2017, ICLR.