论文信息 - Extreme Tensoring for Low-Memory Preconditioning - 字舞流文

Extreme Tensoring for Low-Memory Preconditioning

State-of-the-art models are now trained with billions of parameters, reaching hardware limits in terms of memory consumption. This has created a recent demand for memory-efficient optimizers. To this end, we investigate the limits and performance tradeoffs of memory-efficient adaptively preconditioned gradient methods. We propose extreme tensoring for high-dimensional stochastic optimization, showing that an optimizer needs very little memory to benefit from adaptive preconditioning. Our technique applies to arbitrary models (not necessarily with tensor-shaped parameters), and is accompanied by regret and convergence guarantees, which shed light on the tradeoffs between preconditioner quality and expressivity. On a large-scale NLP model, we reduce the optimizer memory overhead by three orders of magnitude, without degrading performance.

Yi Zhang | Naman Agarwal | Elad Hazan | Xinyi Chen | Cyril Zhang | Elad Hazan | Naman Agarwal | Xinyi Chen | Cyril Zhang | Yi Zhang

[1] Roger B. Grosse,et al. Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[2] Samy Bengio,et al. Tensor2Tensor for Neural Machine Translation , 2018, AMTA.

[3] Yoram Singer,et al. A Unified Approach to Adaptive Regularization in Online and Stochastic Optimization , 2017, ArXiv.

[4] Thorsten Brants,et al. One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[5] Yoram Singer,et al. Shampoo: Preconditioned Stochastic Tensor Optimization , 2018, ICML.

[6] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[7] Xiaoxia Wu,et al. L ] 1 0 A pr 2 01 9 AdaGrad-Norm convergence over nonconvex landscapes AdaGrad stepsizes : sharp convergence over nonconvex landscapes , from any initialization , 2019 .

[8] Elad Hazan,et al. Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[9] Geoffrey E. Hinton,et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[10] Maya R. Gupta,et al. Training highly multiclass classifiers , 2014, J. Mach. Learn. Res..

[11] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12] D. Sculley,et al. Google Vizier: A Service for Black-Box Optimization , 2017, KDD.

[13] Noam Shazeer,et al. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.

[14] Martin Zinkevich,et al. Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[15] Quoc V. Le,et al. Neural Optimizer Search with Reinforcement Learning , 2017, ICML.

[16] Jimmy Ba,et al. Kronecker-factored Curvature Approximations for Recurrent Neural Networks , 2018, ICLR.

[17] Kamyar Azizzadenesheli,et al. signSGD: compressed optimisation for non-convex problems , 2018, ICML.

[18] Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[19] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[21] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[22] Yi Zhang,et al. The Case for Full-Matrix Adaptive Regularization , 2018, ArXiv.

[23] Jinghui Chen,et al. Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks , 2018, IJCAI.