HyperAdam: A Learnable Task-Adaptive Adam for Network Training

Deep neural networks are traditionally trained using human-designed stochastic optimization algorithms, such as SGD and Adam. Recently, the approach of learning to optimize network parameters has emerged as a promising research topic. However, these learned black-box optimizers sometimes do not fully utilize the experience in human-designed optimizers, therefore have limitation in generalization ability. In this paper, a new optimizer, dubbed as \textit{HyperAdam}, is proposed that combines the idea of "learning to optimize" and traditional Adam optimizer. Given a network for training, its parameter update in each iteration generated by HyperAdam is an adaptive combination of multiple updates generated by Adam with varying decay rates. The combination weights and decay rates in HyperAdam are adaptively learned depending on the task. HyperAdam is modeled as a recurrent neural network with AdamCell, WeightCell and StateCell. It is justified to be state-of-the-art for various network training, such as multilayer perceptron, CNN and LSTM.

[1]  Misha Denil,et al.  Learning to Learn without Gradient Descent by Gradient Descent , 2016, ICML.

[2]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[3]  J. Koenderink Q… , 2014, Les noms officiels des communes de Wallonie, de Bruxelles-Capitale et de la communaute germanophone.

[4]  Sepp Hochreiter,et al.  Learning to Learn Using Gradient Descent , 2001, ICANN.

[5]  Misha Denil,et al.  Learned Optimizers that Scale and Generalize , 2017, ICML.

[6]  Ron Meir,et al.  Meta-Learning by Adjusting Priors Based on Extended PAC-Bayes Theory , 2017, ICML.

[7]  Joshua B. Tenenbaum,et al.  Meta-Learning for Semi-Supervised Few-Shot Classification , 2018, ICLR.

[8]  Jian Sun,et al.  Deep ADMM-Net for Compressive Sensing MRI , 2016, NIPS.

[9]  Jürgen Schmidhuber,et al.  Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks , 1992, Neural Computation.

[10]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[11]  Jitendra Malik,et al.  Learning to Optimize , 2016, ICLR.

[12]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[13]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[14]  H. Robbins A Stochastic Approximation Method , 1951 .

[15]  Sanjiv Kumar,et al.  On the Convergence of Adam and Beyond , 2018 .

[16]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[17]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[18]  Brian McMahan,et al.  Listening to the World Improves Speech Command Recognition , 2017, AAAI.

[19]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[20]  G. Evans,et al.  Learning to Optimize , 2008 .

[21]  Sebastian Nowozin,et al.  Learning Step Size Controllers for Robust Neural Network Training , 2016, AAAI.

[22]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[23]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[26]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[27]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[28]  Paul J. Werbos,et al.  Backpropagation Through Time: What It Does and How to Do It , 1990, Proc. IEEE.

[29]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[30]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[31]  Xiangyu Wang,et al.  Proximal Alternating Direction Network: A Globally Converged Deep Unrolling Framework , 2017, AAAI.

[32]  Jian Li,et al.  Learning Gradient Descent: Better Generalization and Longer Horizons , 2017, ICML.

[33]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Richard J. Mammone,et al.  Meta-neural networks that learn by learning , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[35]  W. H. Kirk And and Or , 1921 .

[37]  Daan Wierstra,et al.  Meta-Learning with Memory-Augmented Neural Networks , 2016, ICML.

[38]  Marcin Andrychowicz,et al.  Learning to learn by gradient descent by gradient descent , 2016, NIPS.

[39]  Paul Tseng,et al.  An Incremental Gradient(-Projection) Method with Momentum Term and Adaptive Stepsize Rule , 1998, SIAM J. Optim..

[40]  Danna Zhou,et al.  d. , 1934, Microbial pathogenesis.

[41]  Yisong Yue,et al.  Learning to Infer , 2018, ICLR.

[42]  Sergey Levine,et al.  One-Shot Visual Imitation Learning via Meta-Learning , 2017, CoRL.

[43]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[44]  Sepp Hochreiter,et al.  Meta-learning with backpropagation , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[45]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.