Sketchy: Memory-efficient Adaptive Regularization with Frequent Directions

Adaptive regularization methods that exploit more than the diagonal entries exhibit state of the art performance for many tasks, but can be prohibitive in terms of memory and running time. We find the spectra of the Kronecker-factored gradient covariance matrix in deep learning (DL) training tasks are concentrated on a small leading eigenspace that changes throughout training, motivating a low-rank sketching approach. We describe a generic method for reducing memory and compute requirements of maintaining a matrix preconditioner using the Frequent Directions (FD) sketch. Our technique allows interpolation between resource requirements and the degradation in regret guarantees with rank $k$: in the online convex optimization (OCO) setting over dimension $d$, we match full-matrix $d^2$ memory regret using only $dk$ memory up to additive error in the bottom $d-k$ eigenvalues of the gradient covariance. Further, we show extensions of our work to Shampoo, placing the method on the memory-quality Pareto frontier of several large scale benchmarks.

[1]  Zeke Xie,et al.  On the Overlooked Structure of Stochastic Gradients , 2022, 2212.02083.

[2]  Edo Liberty Even Simpler Deterministic Matrix Sketching , 2022, ArXiv.

[3]  Lijun Zhang,et al.  Efficient Adaptive Online Learning via Frequent Directions , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  William J. Dally,et al.  Evolution of the Graphics Processing Unit (GPU) , 2021, IEEE Micro.

[5]  Peter C. Ma,et al.  Ten Lessons From Three Generations Shaped Google’s TPUv4i : Industrial Product , 2021, 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA).

[6]  Vineeth N Balasubramanian,et al.  A Deeper Look at the Hessian Eigenspectrum of Deep Neural Networks and its Applications to Regularization , 2020, AAAI.

[7]  Michael L. Waskom,et al.  Seaborn: Statistical Data Visualization , 2021, J. Open Source Softw..

[8]  Taiji Suzuki,et al.  When Does Preconditioning Help or Hurt Generalization? , 2020, ICLR.

[9]  O. Papaspiliopoulos High-Dimensional Probability: An Introduction with Applications in Data Science , 2020 .

[10]  Jaime Fern'andez del R'io,et al.  Array programming with NumPy , 2020, Nature.

[11]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[12]  J. Leskovec,et al.  Open Graph Benchmark: Datasets for Machine Learning on Graphs , 2020, NeurIPS.

[13]  Naman Agarwal,et al.  Disentangling Adaptive Gradient Methods from Learning Rates , 2020, ArXiv.

[14]  Joel Nothman,et al.  SciPy 1.0-Fundamental Algorithms for Scientific Computing in Python , 2019, ArXiv.

[15]  Yi Zhang,et al.  Extreme Tensoring for Low-Memory Preconditioning , 2019, ICLR.

[16]  Ashok Cutkosky,et al.  Better Full-Matrix Regret via Parameter-Free Online Learning , 2020, NeurIPS.

[17]  Yi Zhang,et al.  Efficient Full-Matrix Adaptive Regularization , 2020, ICML.

[18]  Elad Hazan,et al.  Lecture Notes: Optimization for Machine Learning , 2019, ArXiv.

[19]  Benjamin Recht,et al.  Do ImageNet Classifiers Generalize to ImageNet? , 2019, ICML.

[20]  Yoram Singer,et al.  Memory Efficient Adaptive Optimization , 2019, NeurIPS.

[21]  Shankar Krishnan,et al.  An Investigation into Neural Net Optimization via Hessian Eigenvalue Density , 2019, ICML.

[22]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[23]  Ethan Dyer,et al.  Gradient Descent Happens in a Tiny Subspace , 2018, ArXiv.

[24]  Razvan Pascanu,et al.  Relational inductive biases, deep learning, and graph networks , 2018, ArXiv.

[25]  Noam Shazeer,et al.  Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.

[26]  Yoram Singer,et al.  Shampoo: Preconditioned Stochastic Tensor Optimization , 2018, ICML.

[27]  Michael J. Henry,et al.  Understanding and Exploiting the Low-Rank Structure of Deep Networks , 2018 .

[28]  Yann Dauphin,et al.  Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , 2017, ICLR.

[29]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[30]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[31]  Tengyu Ma,et al.  Finding approximate local minima faster than gradient descent , 2016, STOC.

[32]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[33]  Joachim M. Buhmann,et al.  Scalable Adaptive Stochastic Optimization Using Random Projections , 2016, NIPS.

[34]  Elad Hazan,et al.  Introduction to Online Convex Optimization , 2016, Found. Trends Optim..

[35]  Haipeng Luo,et al.  Efficient Second Order Online Learning by Sketching , 2016, NIPS.

[36]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  David P. Woodruff,et al.  Frequent Directions: Simple and Deterministic Matrix Sketching , 2015, SIAM J. Comput..

[38]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[40]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[41]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[42]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[43]  K. Audenaert A generalisation of Mirsky's singular value inequalities , 2014, 1410.4941.

[44]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[45]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[46]  Kaare Brandt Petersen,et al.  The Matrix Cookbook , 2006 .

[47]  Santosh S. Vempala,et al.  Efficient algorithms for online decision problems , 2005, J. Comput. Syst. Sci..

[48]  Andrew V. Knyazev,et al.  Toward the Optimal Preconditioned Eigensolver: Locally Optimal Block Preconditioned Conjugate Gradient Method , 2001, SIAM J. Sci. Comput..

[49]  T. Andô Concavity of certain maps on positive definite matrices and applications to Hadamard products , 1979 .

[50]  M. Sain Finite dimensional linear systems , 1972 .