论文信息 - A Generalizable Approach to Learning Optimizers - 字舞流文

A Generalizable Approach to Learning Optimizers

A core issue with learning to optimize neural networks has been the lack of generalization to real world problems. To address this, we describe a system designed from a generalization-first perspective, learning to update optimizer hyperparameters instead of model parameters directly using novel features, actions, and a reward function. This system outperforms Adam at all neural network tasks including on modalities not seen during training. We achieve 2x speedups on ImageNet, and a 2.5x speedup on a language modeling task using over 5 orders of magnitude more compute than the training tasks.

Wojciech Zaremba | Jie Tang | Diogo Almeida | Clemens Winter | Wojciech Zaremba | Jie Tang | Clemens Winter | Diogo Almeida

[1] Mark W. Schmidt,et al. StopWasting My Gradients: Practical SVRG , 2015, NIPS.

[2] Frank Hutter,et al. Fixing Weight Decay Regularization in Adam , 2017, ArXiv.

[3] Richard Socher,et al. Pointer Sentinel Mixture Models , 2016, ICLR.

[4] Roland Vollgraf,et al. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[5] Andrew Y. Ng,et al. Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[6] Demis Hassabis,et al. Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[7] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[8] Yoram Singer,et al. Shampoo: Preconditioned Stochastic Tensor Optimization , 2018, ICML.

[9] Ryan P. Adams,et al. Gradient-based Hyperparameter Optimization through Reversible Learning , 2015, ICML.

[10] Stephen Merity,et al. Single Headed Attention RNN: Stop Thinking With Your Head , 2019, ArXiv.

[11] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[12] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13] Yoshua Bengio,et al. Practical Recommendations for Gradient-Based Training of Deep Architectures , 2012, Neural Networks: Tricks of the Trade.

[14] James Demmel,et al. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[15] Chong Wang,et al. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[16] Marcin Andrychowicz,et al. Learning to learn by gradient descent by gradient descent , 2016, NIPS.

[17] Dario Amodei,et al. An Empirical Model of Large-Batch Training , 2018, ArXiv.

[18] Tat-Seng Chua,et al. Neural Collaborative Filtering , 2017, WWW.

[19] Misha Denil,et al. Learned Optimizers that Scale and Generalize , 2017, ICML.

[20] Leslie N. Smith,et al. Cyclical Learning Rates for Training Neural Networks , 2015, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[21] Thomas Wolf,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[22] Aleksander Madry,et al. The Two Regimes of Deep Network Training , 2020, ArXiv.

[23] Jaehoon Lee,et al. On Empirical Comparisons of Optimizers for Deep Learning , 2019, ArXiv.

[24] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[25] Renjie Liao,et al. Understanding Short-Horizon Bias in Stochastic Meta-Optimization , 2018, ICLR.

[26] Gang Wang,et al. Reinforcement Learning for Learning Rate Control , 2017, ArXiv.

[27] Gerald Tesauro,et al. Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[28] Yoram Singer,et al. Second Order Optimization Made Practical , 2020, ArXiv.

[29] Guodong Zhang,et al. Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model , 2019, NeurIPS.

[30] Roger B. Grosse,et al. Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[31] Marcin Andrychowicz,et al. Asymmetric Actor Critic for Image-Based Robot Learning , 2017, Robotics: Science and Systems.

[32] Yang You,et al. Large Batch Training of Convolutional Networks , 2017, 1708.03888.

[33] Mark W. Schmidt,et al. Online Learning Rate Adaptation with Hypergradient Descent , 2017, ICLR.

[34] Jeremy Nixon,et al. Understanding and correcting pathologies in the training of learned optimizers , 2018, ICML.

[35] Alec Radford,et al. Scaling Laws for Neural Language Models , 2020, ArXiv.

[36] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[37] Massimiliano Pontil,et al. On the Iteration Complexity of Hypergradient Computation , 2020, ICML.

[38] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Jürgen Schmidhuber,et al. Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[40] Jason Yosinski,et al. First-Order Preconditioning via Hypergradient Descent , 2019, ArXiv.

[41] Sebastian Nowozin,et al. Learning Step Size Controllers for Robust Neural Network Training , 2016, AAAI.

[42] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[43] Jascha Sohl-Dickstein,et al. Tasks, stability, architecture, and compute: Training more effective learned optimizers, and using them to train themselves , 2020, ArXiv.