论文信息 - BDA-PCH: Block-Diagonal Approximation of Positive-Curvature Hessian for Training Neural Networks

BDA-PCH: Block-Diagonal Approximation of Positive-Curvature Hessian for Training Neural Networks

We propose a block-diagonal approximation of the positive-curvature Hessian (BDA-PCH) matrix to measure curvature. Our proposed BDAPCH matrix is memory efficient and can be applied to any fully-connected neural networks where the activation and criterion functions are twice differentiable. Particularly, our BDA-PCH matrix can handle non-convex criterion functions. We devise an efficient scheme utilizing the conjugate gradient method to derive Newton directions for mini-batch setting. Empirical studies show that our method outperforms the competing second-order methods in convergence speed.

[1] A. Conv. A Kronecker-factored approximate Fisher matrix for convolution layers , 2016 .

[2] John Wright,et al. Using negative curvature in solving nonlinear programs , 2017, Comput. Optim. Appl..

[3] Boris Polyak,et al. Acceleration of stochastic approximation by averaging , 1992 .

[4] James Martens,et al. Deep learning via Hessian-free optimization , 2010, ICML.

[5] Nicol N. Schraudolph,et al. Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent , 2002, Neural Computation.

[6] David Barber,et al. Practical Gauss-Newton Optimisation for Deep Learning , 2017, ICML.

[7] Barak A. Pearlmutter. Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[8] Stuart E. Dreyfus,et al. Second-order stagewise backpropagation for Hessian-matrix analyses and investigation of negative curvature , 2008, Neural Networks.

[9] Surya Ganguli,et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[10] Kenji Fukumizu,et al. Adaptive Method of Realizing Natural Gradient Learning for Multilayer Perceptrons , 2000, Neural Computation.

[11] Richard Socher,et al. Block-diagonal Hessian-free Optimization for Training Neural Networks , 2017, ArXiv.

[12] Nassir Navab,et al. Robust Optimization for Deep Regression , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[13] Charles V. Stewart,et al. Robust Parameter Estimation in Computer Vision , 1999, SIAM Rev..

[14] Roger B. Grosse,et al. Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[15] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[16] P. Werbos,et al. Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[17] Klaus-Robert Müller,et al. Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[18] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.