ASDL: A Unified Interface for Gradient Preconditioning in PyTorch

Gradient preconditioning is a key technique to integrate the second-order information into gradients for improving and extending gradient-based learning algorithms. In deep learning, stochasticity, nonconvexity, and high dimensionality lead to a wide variety of gradient preconditioning methods, with implementation complexity and inconsistent performance and feasibility. We propose the Automatic Second-order Differentiation Library (ASDL), an extension library for PyTorch, which offers various implementations and a plug-and-play unified interface for gradient preconditioning. ASDL enables the study and structured comparison of a range of gradient preconditioning methods.

[1]  Dong Xu,et al.  Sketch-Based Empirical Natural Gradient Methods for Deep Learning , 2022, Journal of Scientific Computing.

[2]  N. Higham,et al.  Mixed precision algorithms in numerical linear algebra , 2022, Acta Numerica.

[3]  Donald Goldfarb,et al.  Tensor Normal Training for Deep Learning Models , 2021, NeurIPS.

[4]  Yue Wu,et al.  SKFAC: Training Neural Networks with Faster Kronecker-Factored Approximate Curvature , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Yi Zhang,et al.  Efficient Full-Matrix Adaptive Regularization , 2020, ICML.

[6]  Rio Yokota,et al.  Rich Information is Affordable: A Systematic Performance Analysis of Second-order Optimization Using K-FAC , 2020, KDD.

[7]  Yihao Fang,et al.  Optimization of Graph Neural Networks with Natural Gradient Descent , 2020, 2020 IEEE International Conference on Big Data (Big Data).

[8]  Donald Goldfarb,et al.  Practical Quasi-Newton Methods for Training Deep Neural Networks , 2020, NeurIPS.

[9]  Z. Wen,et al.  Sketchy Empirical Natural Gradient Methods for Deep Learning , 2020, 2006.05924.

[10]  Kurt Keutzer,et al.  ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning , 2020, AAAI.

[11]  Mohammad Emtiyaz Khan,et al.  Continual Deep Learning by Functional Regularisation of Memorable Past , 2020, NeurIPS.

[12]  Chuan-Sheng Foo,et al.  Scalable and Practical Natural Gradient for Large-Scale Deep Learning , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Philipp Hennig,et al.  BackPACK: Packing more into backprop , 2019, International Conference on Learning Representations.

[14]  Michael W. Mahoney,et al.  PyHessian: Neural Networks Through the Lens of the Hessian , 2019, 2020 IEEE International Conference on Big Data (Big Data).

[15]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[16]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[17]  Andrew Y. Ng,et al.  NGBoost: Natural Gradient Boosting for Probabilistic Prediction , 2019, ICML.

[18]  J. Stokes,et al.  Quantum Natural Gradient , 2019, Quantum.

[19]  Yi Ren,et al.  Efficient Subsampled Gauss-Newton and Natural Gradient Methods for Training Neural Networks , 2019, ArXiv.

[20]  Frederik Kunstner,et al.  Limitations of the empirical Fisher approximation for natural gradient descent , 2019, NeurIPS.

[21]  Dario Amodei,et al.  An Empirical Model of Large-Batch Training , 2018, ArXiv.

[22]  Satoshi Matsuoka,et al.  Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Didrik Nielsen,et al.  Fast and Scalable Bayesian Deep Learning by Weight-Perturbation in Adam , 2018, ICML.

[24]  Pascal Vincent,et al.  Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis , 2018, NeurIPS.

[25]  Yoram Singer,et al.  Shampoo: Preconditioned Stochastic Tensor Optimization , 2018, ICML.

[26]  Rif A. Saurous,et al.  Neumann Optimizer: A Practical Optimization Algorithm for Deep Neural Networks , 2017, ICLR.

[27]  Guodong Zhang,et al.  Noisy Natural Gradient as Variational Inference , 2017, ICML.

[28]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[29]  Yang You,et al.  Large Batch Training of Convolutional Networks , 2017, 1708.03888.

[30]  David Barber,et al.  Practical Gauss-Newton Optimisation for Deep Learning , 2017, ICML.

[31]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[32]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[33]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[34]  Naman Agarwal,et al.  Second-Order Stochastic Optimization for Machine Learning in Linear Time , 2016, J. Mach. Learn. Res..

[35]  Xi-Lin Li,et al.  Preconditioned Stochastic Gradient Descent , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[36]  Ruslan Salakhutdinov,et al.  Scaling up Natural Gradient by Sparsely Factorizing the Inverse Fisher Matrix , 2015, ICML.

[37]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[38]  Yoshua Bengio,et al.  Equilibrated adaptive learning rates for non-convex optimization , 2015, NIPS.

[39]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[40]  James Martens,et al.  New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..

[41]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[42]  Yann Ollivier,et al.  Riemannian metrics for neural networks I: feedforward networks , 2013, 1303.0818.

[43]  Razvan Pascanu,et al.  Revisiting Natural Gradient for Deep Networks , 2013, ICLR.

[44]  Daniel Povey,et al.  Krylov Subspace Descent for Deep Learning , 2011, AISTATS.

[45]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[46]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[47]  Nicolas Le Roux,et al.  Topmoumoute Online Natural Gradient Algorithm , 2007, NIPS.

[48]  Nicol N. Schraudolph,et al.  Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent , 2002, Neural Computation.

[49]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[50]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[51]  Babak Hassibi,et al.  Second Order Derivatives for Network Pruning: Optimal Brain Surgeon , 1992, NIPS.

[52]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[53]  K. Chard,et al.  Deep Neural Network Training with Distributed K-FAC , 2022, IEEE Transactions on Parallel and Distributed Systems.

[54]  Dan Alistarh,et al.  Efficient Matrix-Free Approximations of Second-Order Information, with Applications to Pruning and Optimization , 2021, ArXiv.

[55]  Kaare Brandt Petersen,et al.  The Matrix Cookbook , 2006 .