BackPACK: Packing more into backprop

Automatic differentiation frameworks are optimized for exactly one thing: computing the average mini-batch gradient. Yet, other quantities such as the variance of the mini-batch gradients or many approximations to the Hessian can, in theory, be computed efficiently, and at the same time as the gradient. While these quantities are of great interest to researchers and practitioners, current deep learning software does not support their automatic calculation. Manually implementing them is burdensome, inefficient if done naively, and the resulting code is rarely shared. This hampers progress in deep learning, and unnecessarily narrows research to focus on gradient descent and its variants; it also complicates replication studies and comparisons between newly developed methods that require those quantities, to the point of impossibility. To address this problem, we introduce BackPACK, an efficient framework built on top of PyTorch, that extends the backpropagation algorithm to extract additional information from first-and second-order derivatives. Its capabilities are illustrated by benchmark reports for computing additional quantities on deep neural networks, and an example application by testing several recent curvature approximations for optimization.

[1]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[2]  Rio Yokota,et al.  Performance Optimizations and Analysis of Distributed Deep Learning with Approximated Second-Order Optimization Method , 2019, ICPP Workshops.

[3]  Frederik Kunstner,et al.  Limitations of the Empirical Fisher Approximation , 2019, NeurIPS.

[4]  Frank Schneider,et al.  DeepOBS: A Deep Learning Optimizer Benchmark Suite , 2019, ICLR.

[5]  Philipp Hennig,et al.  A Modular Approach to Block-diagonal Hessian Approximations for Second-order Optimization Methods , 2019, ArXiv.

[6]  Michael Innes,et al.  Don't Unroll Adjoint: Differentiating SSA-Form Programs , 2018, ArXiv.

[7]  Pascal Vincent,et al.  Fast Approximate Natural Gradient Descent in a Kronecker-factored Eigenbasis , 2018, NeurIPS.

[8]  Mike Innes,et al.  Flux: Elegant machine learning with Julia , 2018, J. Open Source Softw..

[9]  François Fleuret,et al.  Not All Samples Are Created Equal: Deep Learning with Importance Sampling , 2018, ICML.

[10]  Jimmy Ba,et al.  Kronecker-factored Curvature Approximations for Recurrent Neural Networks , 2018, ICLR.

[11]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[12]  David Barber,et al.  Practical Gauss-Newton Optimisation for Deep Learning , 2017, ICML.

[13]  Philipp Hennig,et al.  Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients , 2017, ICML.

[14]  Javier Romero,et al.  Coupling Adaptive Batch Sizes with Learning Rates , 2016, UAI.

[15]  Andy Davis,et al.  This Paper Is Included in the Proceedings of the 12th Usenix Symposium on Operating Systems Design and Implementation (osdi '16). Tensorflow: a System for Large-scale Machine Learning Tensorflow: a System for Large-scale Machine Learning , 2022 .

[16]  Roger B. Grosse,et al.  A Kronecker-factored approximate Fisher matrix for convolution layers , 2016, ICML.

[17]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Zheng Zhang,et al.  MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems , 2015, ArXiv.

[19]  Ian J. Goodfellow,et al.  Efficient Per-Example Gradient Computations , 2015, ArXiv.

[20]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[21]  Barak A. Pearlmutter,et al.  Automatic differentiation in machine learning: a survey , 2015, J. Mach. Learn. Res..

[22]  Christian Szegedy,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[23]  Philipp Hennig,et al.  Probabilistic Line Searches for Stochastic Optimization , 2015, NIPS.

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25]  Thomas Brox,et al.  Striving for Simplicity: The All Convolutional Net , 2014, ICLR.

[26]  James Martens,et al.  New perspectives on the natural gradient method , 2014, ArXiv.

[27]  Patrick Gallinari,et al.  SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent , 2009, J. Mach. Learn. Res..

[28]  S. Dreyfus,et al.  Second-order stagewise backpropagation for Hessian-matrix analyses and investigation of negative curvature , 2008, Neural Networks.

[29]  Nicolas Le Roux,et al.  Topmoumoute Online Natural Gradient Algorithm , 2007, NIPS.

[30]  Nicol N. Schraudolph,et al.  Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent , 2002, Neural Computation.

[31]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[32]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[33]  Kenta Oono,et al.  Chainer : a Next-Generation Open Source Framework for Deep Learning , 2015 .

[34]  Yann LeCun,et al.  Improving the convergence of back-propagation learning with second-order methods , 1989 .