Eva: A General Vectorized Approximation Framework for Second-order Optimization

Second-order optimization algorithms exhibit excellent convergence properties for training deep learning models, but often incur significant computation and memory overheads. This can result in lower training efficiency than the first-order counterparts such as stochastic gradient descent (SGD). In this work, we present a memory- and time-efficient second-order algorithm named Eva with two novel techniques: 1) we construct the second-order information with the Kronecker factorization of small stochastic vectors over a mini-batch of training data to reduce memory consumption, and 2) we derive an efficient update formula without explicitly computing the inverse of matrices using the Sherman-Morrison formula. We further extend Eva to a general vectorized approximation framework to improve the compute and memory efficiency of two existing second-order algorithms (FOOF and Shampoo) without affecting their convergence performance. Extensive experimental results on different models and datasets show that Eva reduces the end-to-end training time up to 2.05x and 2.42x compared to first-order SGD and second-order algorithms (K-FAC and Shampoo), respectively.

[1]  Percy Liang,et al.  Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training , 2023, ArXiv.

[2]  S. Shi,et al.  Accelerating Distributed K-FAC with Efficient Collective Communication and Scheduling , 2023, IEEE INFOCOM 2023 - IEEE Conference on Computer Communications.

[3]  S. Shi,et al.  Scalable K-FAC Training for Deep Neural Networks With Distributed Preconditioning , 2022, IEEE Transactions on Cloud Computing.

[4]  Jeff Z. HaoChen,et al.  Amortized Proximal Optimization , 2022, NeurIPS.

[5]  Frederik Benzing Gradient Descent on Neurons and its Link to Approximate Second-Order Optimization , 2022, ICML.

[6]  Dan Alistarh,et al.  M-FAC: Efficient Matrix-Free Approximations of Second-Order Information , 2021, NeurIPS.

[7]  Kyle Chard,et al.  KAISA: An Adaptive Second-Order Optimizer Framework for Deep Neural Networks , 2021, SC21: International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Shaohuai Shi,et al.  Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks , 2021, 2021 IEEE 41st International Conference on Distributed Computing Systems (ICDCS).

[9]  Donald Goldfarb,et al.  Tensor Normal Training for Deep Learning Models , 2021, NeurIPS.

[10]  Dan Alistarh,et al.  Communication-Efficient Distributed Optimization with Quantized Preconditioners , 2021, ICML.

[11]  Rio Yokota,et al.  Rich Information is Affordable: A Systematic Performance Analysis of Second-order Optimization Using K-FAC , 2020, KDD.

[12]  Ian T. Foster,et al.  Convolutional Neural Network Training with Distributed K-FAC , 2020, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Donald Goldfarb,et al.  Practical Quasi-Newton Methods for Training Deep Neural Networks , 2020, NeurIPS.

[14]  Kurt Keutzer,et al.  ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning , 2020, AAAI.

[15]  Chuan-Sheng Foo,et al.  Scalable and Practical Natural Gradient for Large-Scale Deep Learning , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  M. Shoeybi,et al.  Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism , 2019, ArXiv.

[17]  Quoc V. Le,et al.  AutoAugment: Learning Augmentation Strategies From Data , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Seong Joon Oh,et al.  CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Satoshi Matsuoka,et al.  Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Quoc V. Le,et al.  GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism , 2018, NeurIPS.

[21]  John C. Duchi,et al.  Stochastic (Approximate) Proximal Point Methods: Convergence, Optimality, and Adaptivity , 2018, SIAM J. Optim..

[22]  Yoram Singer,et al.  Shampoo: Preconditioned Stochastic Tensor Optimization , 2018, ICML.

[23]  Alexander Sergeev,et al.  Horovod: fast and easy distributed deep learning in TensorFlow , 2018, ArXiv.

[24]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[25]  Elad Hoffer,et al.  Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.

[26]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[27]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[28]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[29]  Roger B. Grosse,et al.  A Kronecker-factored approximate Fisher matrix for convolution layers , 2016, ICML.

[30]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Andrea Montanari,et al.  Convergence rates of sub-sampled Newton methods , 2015, NIPS.

[32]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[33]  Ya-Xiang Yuan,et al.  Recent advances in trust region algorithms , 2015, Mathematical Programming.

[34]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[35]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[36]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[37]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[38]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[39]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[41]  John E. Dennis,et al.  Numerical methods for unconstrained optimization and nonlinear equations , 1983, Prentice Hall series in computational mathematics.

[42]  J. Sherman,et al.  Adjustment of an Inverse Matrix Corresponding to a Change in One Element of a Given Matrix , 1950 .

[43]  S. Shi,et al.  Eva: Practical Second-order Optimization with Kronecker-vectorized Approximation , 2023, International Conference on Learning Representations.

[44]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[45]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[46]  L. Eon Bottou Online Learning and Stochastic Approximations , 1998 .