Neural Optimizer Search with Reinforcement Learning

We present an approach to automate the process of discovering optimization methods, with a focus on deep learning architectures. We train a Recurrent Neural Network controller to generate a string in a domain specific language that describes a mathematical update equation based on a list of primitive functions, such as the gradient, running average of the gradient, etc. The controller is trained with Reinforcement Learning to maximize the performance of a model after a few epochs. On CIFAR-10, our method discovers several update rules that are better than many commonly used optimizers, such as Adam, RMSProp, or SGD with and without Momentum on a ConvNet model. We introduce two new optimizers, named PowerSign and AddSign, which we show transfer well and improve training on a variety of different tasks and architectures, including ImageNet classification and Google's neural machine translation system.

[1]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[2]  Martin A. Riedmiller,et al.  RPROP - A Fast Adaptive Learning Algorithm , 1992 .

[3]  Jurgen Schmidhuber Steps Towards 'Self-Referential' Neural Learning: A Thought Experiment ; CU-CS-627-92 , 1992 .

[4]  J. Urgen Schmidhuber,et al.  Steps Towards`self-referential' Neural Learning: a Thought Experiment , 1992 .

[5]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[6]  Samy Bengio,et al.  Use of genetic programming for the search of a new learning rule for neural networks , 1994, Proceedings of the First IEEE Conference on Evolutionary Computation. IEEE World Congress on Computational Intelligence.

[7]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[8]  Magnus Thor Jonsson,et al.  Evolution and design of distributed learning rules , 2000, 2000 IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks. Proceedings of the First IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks (Cat. No.00.

[9]  Sepp Hochreiter,et al.  Learning to Learn Using Gradient Descent , 2001, ICANN.

[10]  Nicol N. Schraudolph,et al.  Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent , 2002, Neural Computation.

[11]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[12]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[13]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[14]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[15]  Quoc V. Le,et al.  On optimization methods for deep learning , 2011, ICML.

[16]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[17]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[18]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[19]  Ilya Sutskever,et al.  Training Deep and Recurrent Networks with Hessian-Free Optimization , 2012, Neural Networks: Tricks of the Trade.

[20]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[21]  Tom Schaul,et al.  No more pesky learning rates , 2012, ICML.

[22]  Razvan Pascanu,et al.  Revisiting Natural Gradient for Deep Networks , 2013, ICLR.

[23]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[24]  Quoc V. Le,et al.  Adding Gradient Noise Improves Learning for Very Deep Networks , 2015, ArXiv.

[25]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[26]  Zoubin Ghahramani,et al.  A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[27]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Restarts , 2016, ArXiv.

[28]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[29]  Marcin Andrychowicz,et al.  Learning to learn by gradient descent by gradient descent , 2016, NIPS.

[30]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[31]  Lin Wang,et al.  The evolution of a generalized neural learning rule , 2016, 2016 International Joint Conference on Neural Networks (IJCNN).

[32]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[33]  Roger B. Grosse,et al.  Distributed Second-Order Optimization using Kronecker-Factored Approximations , 2016, ICLR.

[34]  Hugo Larochelle,et al.  Optimization as a Model for Few-Shot Learning , 2016, ICLR.

[35]  Hakan Inan,et al.  Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling , 2016, ICLR.

[36]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[37]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[38]  Jitendra Malik,et al.  Learning to Optimize Neural Nets , 2017, ArXiv.

[39]  Lior Wolf,et al.  Using the Output Embedding to Improve Language Models , 2016, EACL.

[40]  Quoc V. Le,et al.  Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[41]  Ramesh Raskar,et al.  Designing Neural Network Architectures using Reinforcement Learning , 2016, ICLR.

[42]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[43]  D. Sculley,et al.  Google Vizier: A Service for Black-Box Optimization , 2017, KDD.

[44]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[45]  Misha Denil,et al.  Learned Optimizers that Scale and Generalize , 2017, ICML.

[46]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[47]  Jitendra Malik,et al.  Learning to Optimize , 2016, ICLR.

[48]  Vijay Vasudevan,et al.  Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[49]  Chris Dyer,et al.  On the State of the Art of Evaluation in Neural Language Models , 2017, ICLR.