论文信息 - Shampoo: Preconditioned Stochastic Tensor Optimization

Shampoo: Preconditioned Stochastic Tensor Optimization

Preconditioned gradient methods are among the most general and powerful tools in optimization. However, preconditioning requires storing and manipulating prohibitively large matrices. We describe and analyze a new structure-aware preconditioning algorithm, called Shampoo, for stochastic optimization over tensor spaces. Shampoo maintains a set of preconditioning matrices, each of which operates on a single dimension, contracting over the remaining dimensions. We establish convergence guarantees in the stochastic convex setting, the proof of which builds upon matrix trace inequalities. Our experiments with state-of-the-art deep learning models show that Shampoo is capable of converging considerably faster than commonly used optimizers. Although it involves a more complex update rule, Shampoo's runtime per step is comparable to that of simple gradient methods such as SGD, AdaGrad, and Adam.

[1] Naman Agarwal,et al. Second Order Stochastic Optimization in Linear Time , 2016, ArXiv.

[2] Naman Agarwal,et al. Second-Order Stochastic Optimization for Machine Learning in Linear Time , 2016, J. Mach. Learn. Res..

[3] Shai Shalev-Shwartz,et al. Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[4] Claudio Gentile,et al. On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[5] Chi-Kwong Li. Geometric Means , 2003 .

[6] Charles R. Johnson,et al. Topics in Matrix Analysis , 1991 .

[7] Yuan Yu,et al. TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[8] Elad Hazan,et al. Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[9] Ruslan Salakhutdinov,et al. Path-SGD: Path-Normalized Optimization in Deep Neural Networks , 2015, NIPS.

[10] Roger B. Grosse,et al. Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[11] Andrea Montanari,et al. Convergence rates of sub-sampled Newton methods , 2015, NIPS.

[12] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[13] Samy Bengio,et al. Understanding deep learning requires rethinking generalization , 2016, ICLR.

[14] Karl Löwner. Über monotone Matrixfunktionen , 1934 .

[15] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16] Peng Xu,et al. Sub-sampled Newton Methods with Non-uniform Sampling , 2016, NIPS.

[17] Santosh S. Vempala,et al. Efficient algorithms for online decision problems , 2005, J. Comput. Syst. Sci..

[18] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[19] Yoram Singer,et al. A Unified Approach to Adaptive Regularization in Online and Stochastic Optimization , 2017, ArXiv.

[20] Martin J. Wainwright,et al. Newton Sketch: A Near Linear-Time Optimization Algorithm with Linear-Quadratic Convergence , 2015, SIAM J. Optim..

[21] R. Fletcher. Practical Methods of Optimization , 1988 .

[22] J. Nocedal. Updating Quasi-Newton Matrices With Limited Storage , 1980 .

[23] Thorsten Brants,et al. One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[24] Adrian S. Lewis,et al. Nonsmooth optimization via quasi-Newton methods , 2012, Mathematical Programming.

[25] Shai Shalev-Shwartz,et al. Faster SGD Using Sketched Conditioning , 2015, ArXiv.