EA-CG: An Approximate Second-Order Method for Training Fully-Connected Neural Networks

For training fully-connected neural networks (FCNNs), we propose a practical approximate second-order method including: 1) an approximation of the Hessian matrix and 2) a conjugate gradient (CG) based method. Our proposed approximate Hessian matrix is memory-efficient and can be applied to any FCNNs where the activation and criterion functions are twice differentiable. We devise a CG-based method incorporating one-rank approximation to derive Newton directions for training FCNNs, which significantly reduces both space and time complexity. This CG-based method can be employed to solve any linear equation where the coefficient matrix is Kronecker-factored, symmetric and positive definite. Empirical studies show the efficacy and efficiency of our proposed method.

[2]  Nicol N. Schraudolph,et al.  Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent , 2002, Neural Computation.

[3]  John Wright,et al.  Using negative curvature in solving nonlinear programs , 2017, Comput. Optim. Appl..

[4]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[7]  Charles V. Stewart,et al.  Robust Parameter Estimation in Computer Vision , 1999, SIAM Rev..

[8]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[9]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[10]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[11]  Edward Y. Chang,et al.  Artificial Intelligence in XPRIZE DeepQ Tricorder , 2017, MMHealth@MM.

[12]  David Barber,et al.  Practical Gauss-Newton Optimisation for Deep Learning , 2017, ICML.

[13]  Roger B. Grosse,et al.  A Kronecker-factored approximate Fisher matrix for convolution layers , 2016, ICML.

[14]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[15]  Richard Socher,et al.  Block-diagonal Hessian-free Optimization for Training Neural Networks , 2017, ArXiv.

[16]  Nassir Navab,et al.  Robust Optimization for Deep Regression , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[18]  Stuart E. Dreyfus,et al.  Second-order stagewise backpropagation for Hessian-matrix analyses and investigation of negative curvature , 2008, Neural Networks.

[19]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[20]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[21]  Kenji Fukumizu,et al.  Adaptive Method of Realizing Natural Gradient Learning for Multilayer Perceptrons , 2000, Neural Computation.

[22]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.