BDA-PCH: Block-Diagonal Approximation of Positive-Curvature Hessian for Training Neural Networks

We propose a block-diagonal approximation of the positive-curvature Hessian (BDA-PCH) matrix to measure curvature. Our proposed BDAPCH matrix is memory efficient and can be applied to any fully-connected neural networks where the activation and criterion functions are twice differentiable. Particularly, our BDA-PCH matrix can handle non-convex criterion functions. We devise an efficient scheme utilizing the conjugate gradient method to derive Newton directions for mini-batch setting. Empirical studies show that our method outperforms the competing second-order methods in convergence speed.

[1]  A. Conv A Kronecker-factored approximate Fisher matrix for convolution layers , 2016 .

[2]  John Wright,et al.  Using negative curvature in solving nonlinear programs , 2017, Comput. Optim. Appl..

[3]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[4]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[5]  Nicol N. Schraudolph,et al.  Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent , 2002, Neural Computation.

[6]  David Barber,et al.  Practical Gauss-Newton Optimisation for Deep Learning , 2017, ICML.

[7]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[8]  Stuart E. Dreyfus,et al.  Second-order stagewise backpropagation for Hessian-matrix analyses and investigation of negative curvature , 2008, Neural Networks.

[9]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[10]  Kenji Fukumizu,et al.  Adaptive Method of Realizing Natural Gradient Learning for Multilayer Perceptrons , 2000, Neural Computation.

[11]  Richard Socher,et al.  Block-diagonal Hessian-free Optimization for Training Neural Networks , 2017, ArXiv.

[12]  Nassir Navab,et al.  Robust Optimization for Deep Regression , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[13]  Charles V. Stewart,et al.  Robust Parameter Estimation in Computer Vision , 1999, SIAM Rev..

[14]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[15]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[16]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[17]  Klaus-Robert Müller,et al.  Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.