论文信息 - ASDL: A Unified Interface for Gradient Preconditioning in PyTorch

ASDL: A Unified Interface for Gradient Preconditioning in PyTorch

Gradient preconditioning is a key technique to integrate the second-order information into gradients for improving and extending gradient-based learning algorithms. In deep learning, stochasticity, nonconvexity, and high dimensionality lead to a wide variety of gradient preconditioning methods, with implementation complexity and inconsistent performance and feasibility. We propose the Automatic Second-order Differentiation Library (ASDL), an extension library for PyTorch, which offers various implementations and a plug-and-play unified interface for gradient preconditioning. ASDL enables the study and structured comparison of a range of gradient preconditioning methods.

[1] Dong Xu,et al. Sketch-Based Empirical Natural Gradient Methods for Deep Learning , 2022, Journal of Scientific Computing.

[2] N. Higham,et al. Mixed precision algorithms in numerical linear algebra , 2022, Acta Numerica.

[3] Donald Goldfarb,et al. Tensor Normal Training for Deep Learning Models , 2021, NeurIPS.

[4] Yue Wu,et al. SKFAC: Training Neural Networks with Faster Kronecker-Factored Approximate Curvature , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Yi Zhang,et al. Efficient Full-Matrix Adaptive Regularization , 2020, ICML.

[6] Rio Yokota,et al. Rich Information is Affordable: A Systematic Performance Analysis of Second-order Optimization Using K-FAC , 2020, KDD.

[7] Yihao Fang,et al. Optimization of Graph Neural Networks with Natural Gradient Descent , 2020, 2020 IEEE International Conference on Big Data (Big Data).

[8] Donald Goldfarb,et al. Practical Quasi-Newton Methods for Training Deep Neural Networks , 2020, NeurIPS.

[9] Z. Wen,et al. Sketchy Empirical Natural Gradient Methods for Deep Learning , 2020, 2006.05924.

[10] Kurt Keutzer,et al. ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning , 2020, AAAI.

[11] Mohammad Emtiyaz Khan,et al. Continual Deep Learning by Functional Regularisation of Memorable Past , 2020, NeurIPS.

[12] Chuan-Sheng Foo,et al. Scalable and Practical Natural Gradient for Large-Scale Deep Learning , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13] Philipp Hennig,et al. BackPACK: Packing more into backprop , 2019, International Conference on Learning Representations.

[14] Michael W. Mahoney,et al. PyHessian: Neural Networks Through the Lens of the Hessian , 2019, 2020 IEEE International Conference on Big Data (Big Data).

[15] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[16] R'emi Louf,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[17] Andrew Y. Ng,et al. NGBoost: Natural Gradient Boosting for Probabilistic Prediction , 2019, ICML.

[18] J. Stokes,et al. Quantum Natural Gradient , 2019, Quantum.

[19] Yi Ren,et al. Efficient Subsampled Gauss-Newton and Natural Gradient Methods for Training Neural Networks , 2019, ArXiv.

[20] Frederik Kunstner,et al. Limitations of the empirical Fisher approximation for natural gradient descent , 2019, NeurIPS.

[21] Dario Amodei,et al. An Empirical Model of Large-Batch Training , 2018, ArXiv.

[22] Satoshi Matsuoka,et al. Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Didrik Nielsen,et al. Fast and Scalable Bayesian Deep Learning by Weight-Perturbation in Adam , 2018, ICML.

[24] Pascal Vincent,et al. Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis , 2018, NeurIPS.

[25] Yoram Singer,et al. Shampoo: Preconditioned Stochastic Tensor Optimization , 2018, ICML.

[26] Rif A. Saurous,et al. Neumann Optimizer: A Practical Optimization Algorithm for Deep Neural Networks , 2017, ICLR.

[27] Guodong Zhang,et al. Noisy Natural Gradient as Variational Inference , 2017, ICML.

[28] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[29] Yang You,et al. Large Batch Training of Convolutional Networks , 2017, 1708.03888.

[30] David Barber,et al. Practical Gauss-Newton Optimisation for Deep Learning , 2017, ICML.

[31] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[32] Percy Liang,et al. Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[33] Razvan Pascanu,et al. Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[34] Naman Agarwal,et al. Second-Order Stochastic Optimization for Machine Learning in Linear Time , 2016, J. Mach. Learn. Res..

[35] Xi-Lin Li,et al. Preconditioned Stochastic Gradient Descent , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[36] Ruslan Salakhutdinov,et al. Scaling up Natural Gradient by Sparsely Factorizing the Inverse Fisher Matrix , 2015, ICML.

[37] Roger B. Grosse,et al. Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[38] Yoshua Bengio,et al. Equilibrated adaptive learning rates for non-convex optimization , 2015, NIPS.

[39] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[40] James Martens,et al. New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..

[41] Surya Ganguli,et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[42] Yann Ollivier,et al. Riemannian metrics for neural networks I: feedforward networks , 2013, 1303.0818.

[43] Razvan Pascanu,et al. Revisiting Natural Gradient for Deep Networks , 2013, ICLR.

[44] Daniel Povey,et al. Krylov Subspace Descent for Deep Learning , 2011, AISTATS.

[45] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[46] James Martens,et al. Deep learning via Hessian-free optimization , 2010, ICML.

[47] Nicolas Le Roux,et al. Topmoumoute Online Natural Gradient Algorithm , 2007, NIPS.

[48] Nicol N. Schraudolph,et al. Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent , 2002, Neural Computation.

[49] Sham M. Kakade,et al. A Natural Policy Gradient , 2001, NIPS.

[50] Shun-ichi Amari,et al. Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[51] Babak Hassibi,et al. Second Order Derivatives for Network Pruning: Optimal Brain Surgeon , 1992, NIPS.

[52] Jorge Nocedal,et al. On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[53] K. Chard,et al. Deep Neural Network Training with Distributed K-FAC , 2022, IEEE Transactions on Parallel and Distributed Systems.

[54] Dan Alistarh,et al. Efficient Matrix-Free Approximations of Second-Order Information, with Applications to Pruning and Optimization , 2021, ArXiv.

[55] Kaare Brandt Petersen,et al. The Matrix Cookbook , 2006 .