Kronecker-factored Curvature Approximations for Recurrent Neural Networks

Kronecker-factor Approximate Curvature (Martens & Grosse, 2015) (K-FAC) is a 2nd-order optimization method which has been shown to give state-of-the-art performance on large-scale neural network optimization tasks (Ba et al., 2017). It is based on an approximation to the Fisher information matrix (FIM) that makes assumptions about the particular structure of the network and the way it is parameterized. The original K-FAC method was applicable only to fully-connected networks, although it has been recently extended by Grosse & Martens (2016) to handle convolutional networks as well. In this work we extend the method to handle RNNs by introducing a novel approximation to the FIM for RNNs. This approximation works by modelling the statistical structure between the gradient contributions at different time-steps using a chain-structured linear Gaussian graphical model, summing the various cross-moments, and computing the inverse in closed form. We demonstrate in experiments that our method significantly outperforms general purpose state-of-the-art optimizers like SGD with momentum and Adam on several challenging RNN training tasks.

[1]  Karan Singh,et al.  Learning Linear Dynamical Systems via Spectral Filtering , 2017, NIPS.

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  Razvan Pascanu,et al.  Natural Neural Networks , 2015, NIPS.

[4]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[5]  Razvan Pascanu,et al.  Revisiting Natural Gradient for Deep Networks , 2013, ICLR.

[6]  James Martens,et al.  New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..

[7]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[9]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[10]  Tara N. Sainath,et al.  Accelerating Hessian-free optimization for Deep Neural Networks by implicit preconditioning and sampling , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[11]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[12]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[13]  Ilya Sutskever,et al.  Training Deep and Recurrent Networks with Hessian-Free Optimization , 2012, Neural Networks: Tricks of the Trade.

[14]  A. Conv A Kronecker-factored approximate Fisher matrix for convolution layers , 2016 .

[15]  Richard Socher,et al.  Improving Generalization Performance by Switching from Adam to SGD , 2017, ArXiv.

[16]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[17]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[18]  Ilya Sutskever,et al.  SUBWORD LANGUAGE MODELING WITH NEURAL NETWORKS , 2011 .

[19]  Fuzhen Zhang The Schur complement and its applications , 2005 .

[20]  Tapani Raiko,et al.  Pushing Stochastic Gradient towards Second-Order Methods -- Backpropagation Learning with Transformations in Nonlinearities , 2013, ICLR.

[21]  Nathan Srebro,et al.  The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.

[22]  Razvan Pascanu,et al.  Metric-Free Natural Gradient for Joint-Training of Boltzmann Machines , 2013, ICLR.

[23]  Nicolas Le Roux,et al.  Topmoumoute Online Natural Gradient Algorithm , 2007, NIPS.

[24]  Ruslan Salakhutdinov,et al.  Scaling up Natural Gradient by Sparsely Factorizing the Inverse Fisher Matrix , 2015, ICML.

[25]  C. Loan The ubiquitous Kronecker product , 2000 .

[26]  Tom Heskes,et al.  On Natural Learning and Pruning in Multilayered Perceptrons , 2000, Neural Computation.

[27]  Sergio Gomez Colmenarejo,et al.  Hybrid computing using a neural network with dynamic external memory , 2016, Nature.

[28]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[29]  Ilya Sutskever,et al.  Learning Recurrent Neural Networks with Hessian-Free Optimization , 2011, ICML.

[30]  Nicol N. Schraudolph,et al.  Centering Neural Network Gradient Factors , 1996, Neural Networks: Tricks of the Trade.

[31]  Roger B. Grosse,et al.  Distributed Second-Order Optimization using Kronecker-Factored Approximations , 2016, ICLR.

[32]  Deli Zhao,et al.  Sparse Coding and Dictionary Learning with Linear Dynamical Systems , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Sanjeev Khudanpur,et al.  Parallel training of DNNs with Natural Gradient and Parameter Averaging , 2014 .

[34]  Ryan P. Adams,et al.  Early Stopping as Nonparametric Variational Inference , 2015, AISTATS.