Efficient Full-Matrix Adaptive Regularization

Adaptive regularization methods pre-multiply a descent direction by a preconditioning matrix. Due to the large number of parameters of machine learning problems, full-matrix preconditioning methods are prohibitively expensive. We show how to modify full-matrix adaptive regularization in order to make it practical and effective. We also provide a novel theoretical analysis for adaptive regularization in non-convex optimization settings. The core of our algorithm, termed GGT, consists of the efficient computation of the inverse square root of a low-rank matrix. Our preliminary experiments show improved iteration-wise convergence rates across synthetic tasks and standard deep learning benchmarks, and that the more carefully-preconditioned steps sometimes lead to a better solution.

[1]  Tengyu Ma,et al.  Finding approximate local minima faster than gradient descent , 2016, STOC.

[2]  Jimmy Ba,et al.  Kronecker-factored Curvature Approximations for Recurrent Neural Networks , 2018, ICLR.

[3]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[4]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[5]  Ann Bies,et al.  The Penn Treebank: Annotating Predicate Argument Structure , 1994, HLT.

[6]  Richard Socher,et al.  An Analysis of Neural Language Modeling at Multiple Scales , 2018, ArXiv.

[7]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[8]  Francesco Orabona,et al.  On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes , 2018, AISTATS.

[9]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[10]  Elad Hazan,et al.  Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[11]  Sanjiv Kumar,et al.  On the Convergence of Adam and Beyond , 2018 .

[12]  Mingyi Hong,et al.  On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization , 2018, ICLR.

[13]  Jascha Sohl-Dickstein,et al.  SVCCA: Singular Vector Canonical Correlation Analysis for Deep Understanding and Improvement , 2017, ArXiv.

[14]  Haipeng Luo,et al.  Efficient Second Order Online Learning by Sketching , 2016, NIPS.

[15]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Sanjiv Kumar,et al.  Escaping Saddle Points with Adaptive Gradient Methods , 2019, ICML.

[18]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[19]  Yoram Singer,et al.  Shampoo: Preconditioned Stochastic Tensor Optimization , 2018, ICML.

[20]  Andrea Montanari,et al.  Convergence rates of sub-sampled Newton methods , 2015, NIPS.

[21]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[22]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[23]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[24]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[25]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[26]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Restarts , 2016, ArXiv.

[27]  Naman Agarwal,et al.  Second-Order Stochastic Optimization for Machine Learning in Linear Time , 2016, J. Mach. Learn. Res..

[28]  Enhong Chen,et al.  Universal Stagewise Learning for Non-Convex Problems with Convergence on Averaged Solutions , 2018, ICLR.

[29]  Richard Socher,et al.  Improving Generalization Performance by Switching from Adam to SGD , 2017, ArXiv.

[30]  Li Shen,et al.  On the Convergence of AdaGrad with Momentum for Training Deep Neural Networks , 2018, ArXiv.

[31]  Yuan Cao,et al.  On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization , 2018, ArXiv.

[32]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[33]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[34]  Yoshua Bengio,et al.  Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations , 2016, ICLR.

[35]  Xavier Gastaldi,et al.  Shake-Shake regularization , 2017, ArXiv.

[36]  Xiaoxia Wu,et al.  L ] 1 0 A pr 2 01 9 AdaGrad-Norm convergence over nonconvex landscapes AdaGrad stepsizes : sharp convergence over nonconvex landscapes , from any initialization , 2019 .

[37]  Timothy Dozat,et al.  Incorporating Nesterov Momentum into Adam , 2016 .

[38]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[39]  Yi Zhang,et al.  Stronger generalization bounds for deep nets via a compression approach , 2018, ICML.

[40]  Hao Li,et al.  Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.

[41]  Nathan Srebro,et al.  The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.

[42]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[43]  Alistair P. Rendell,et al.  CompAdaGrad: A Compressed, Complementary, Computationally-Efficient Adaptive Gradient Method , 2016, ArXiv.

[44]  Yair Carmon,et al.  "Convex Until Proven Guilty": Dimension-Free Acceleration of Gradient Descent on Non-Convex Functions , 2017, ICML.

[45]  Zeyuan Allen-Zhu,et al.  Katyusha: the first direct acceleration of stochastic gradient methods , 2016, J. Mach. Learn. Res..

[46]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[47]  Chris Dyer,et al.  On the State of the Art of Evaluation in Neural Language Models , 2017, ICLR.

[48]  Joachim M. Buhmann,et al.  Scalable Adaptive Stochastic Optimization Using Random Projections , 2016, NIPS.

[49]  Zeyuan Allen Zhu,et al.  Variance Reduction for Faster Non-Convex Optimization , 2016, ICML.