论文信息 - Efficient Full-Matrix Adaptive Regularization - 字舞流文

Efficient Full-Matrix Adaptive Regularization

Adaptive regularization methods pre-multiply a descent direction by a preconditioning matrix. Due to the large number of parameters of machine learning problems, full-matrix preconditioning methods are prohibitively expensive. We show how to modify full-matrix adaptive regularization in order to make it practical and effective. We also provide a novel theoretical analysis for adaptive regularization in non-convex optimization settings. The core of our algorithm, termed GGT, consists of the efficient computation of the inverse square root of a low-rank matrix. Our preliminary experiments show improved iteration-wise convergence rates across synthetic tasks and standard deep learning benchmarks, and that the more carefully-preconditioned steps sometimes lead to a better solution.

Yi Zhang | Karan Singh | Naman Agarwal | Elad Hazan | Cyril Zhang | Brian Bullins | Xinyi Chen | Elad Hazan | Naman Agarwal | Brian Bullins | Xinyi Chen | Karan Singh | Cyril Zhang | Yi Zhang

[1] Tengyu Ma,et al. Finding approximate local minima faster than gradient descent , 2016, STOC.

[2] Jimmy Ba,et al. Kronecker-factored Curvature Approximations for Recurrent Neural Networks , 2018, ICLR.

[3] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[4] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .

[5] Ann Bies,et al. The Penn Treebank: Annotating Predicate Argument Structure , 1994, HLT.

[6] Richard Socher,et al. An Analysis of Neural Language Modeling at Multiple Scales , 2018, ArXiv.

[7] Zheng Zhang,et al. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[8] Francesco Orabona,et al. On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes , 2018, AISTATS.

[9] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[10] Elad Hazan,et al. Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[11] Sanjiv Kumar,et al. On the Convergence of Adam and Beyond , 2018 .

[12] Mingyi Hong,et al. On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization , 2018, ICLR.

[13] Jascha Sohl-Dickstein,et al. SVCCA: Singular Vector Canonical Correlation Analysis for Deep Understanding and Improvement , 2017, ArXiv.

[14] Haipeng Luo,et al. Efficient Second Order Online Learning by Sketching , 2016, NIPS.

[15] Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[16] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17] Sanjiv Kumar,et al. Escaping Saddle Points with Adaptive Gradient Methods , 2019, ICML.

[18] Richard Socher,et al. Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[19] Yoram Singer,et al. Shampoo: Preconditioned Stochastic Tensor Optimization , 2018, ICML.

[20] Andrea Montanari,et al. Convergence rates of sub-sampled Newton methods , 2015, NIPS.

[21] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[22] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[23] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[24] Razvan Pascanu,et al. On the difficulty of training recurrent neural networks , 2012, ICML.

[25] Saeed Ghadimi,et al. Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[26] Frank Hutter,et al. SGDR: Stochastic Gradient Descent with Restarts , 2016, ArXiv.

[27] Naman Agarwal,et al. Second-Order Stochastic Optimization for Machine Learning in Linear Time , 2016, J. Mach. Learn. Res..

[28] Enhong Chen,et al. Universal Stagewise Learning for Non-Convex Problems with Convergence on Averaged Solutions , 2018, ICLR.

[29] Richard Socher,et al. Improving Generalization Performance by Switching from Adam to SGD , 2017, ArXiv.

[30] Li Shen,et al. On the Convergence of AdaGrad with Momentum for Training Deep Neural Networks , 2018, ArXiv.

[31] Yuan Cao,et al. On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization , 2018, ArXiv.

[32] Shun-ichi Amari,et al. Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[33] Roger B. Grosse,et al. Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[34] Yoshua Bengio,et al. Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations , 2016, ICLR.

[35] Xavier Gastaldi,et al. Shake-Shake regularization , 2017, ArXiv.

[36] Xiaoxia Wu,et al. L ] 1 0 A pr 2 01 9 AdaGrad-Norm convergence over nonconvex landscapes AdaGrad stepsizes : sharp convergence over nonconvex landscapes , from any initialization , 2019 .

[37] Timothy Dozat,et al. Incorporating Nesterov Momentum into Adam , 2016 .

[38] Jorge Nocedal,et al. On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[39] Yi Zhang,et al. Stronger generalization bounds for deep nets via a compression approach , 2018, ICML.

[40] Hao Li,et al. Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.

[41] Nathan Srebro,et al. The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.

[42] Sébastien Bubeck,et al. Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[43] Alistair P. Rendell,et al. CompAdaGrad: A Compressed, Complementary, Computationally-Efficient Adaptive Gradient Method , 2016, ArXiv.

[44] Yair Carmon,et al. "Convex Until Proven Guilty": Dimension-Free Acceleration of Gradient Descent on Non-Convex Functions , 2017, ICML.

[45] Zeyuan Allen-Zhu,et al. Katyusha: the first direct acceleration of stochastic gradient methods , 2016, J. Mach. Learn. Res..

[46] Boris Polyak,et al. Acceleration of stochastic approximation by averaging , 1992 .

[47] Chris Dyer,et al. On the State of the Art of Evaluation in Neural Language Models , 2017, ICLR.

[48] Joachim M. Buhmann,et al. Scalable Adaptive Stochastic Optimization Using Random Projections , 2016, NIPS.

[49] Zeyuan Allen Zhu,et al. Variance Reduction for Faster Non-Convex Optimization , 2016, ICML.