论文信息 - Practical Gauss-Newton Optimisation for Deep Learning - 字舞流文

Practical Gauss-Newton Optimisation for Deep Learning

We present an efficient block-diagonal ap- proximation to the Gauss-Newton matrix for feedforward neural networks. Our result- ing algorithm is competitive against state- of-the-art first order optimisation methods, with sometimes significant improvement in optimisation performance. Unlike first-order methods, for which hyperparameter tuning of the optimisation parameters is often a labo- rious process, our approach can provide good performance even when used with default set- tings. A side result of our work is that for piecewise linear transfer functions, the net- work objective function can have no differ- entiable local maxima, which may partially explain why such transfer functions facilitate effective optimisation.

David Barber | Hippolyt Ritter | Aleksandar Botev | D. Barber | Hippolyt Ritter | Aleksandar Botev | H. Ritter

[1] James Martens,et al. Deep learning via Hessian-free optimization , 2010, ICML.

[2] Boris Polyak. Some methods of speeding up the convergence of iteration methods , 1964 .

[3] Ilya Sutskever,et al. Learning Recurrent Neural Networks with Hessian-Free Optimization , 2011, ICML.

[4] Geoffrey E. Hinton,et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[5] Barak A. Pearlmutter. Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[6] Colin Raffel,et al. Lasagne: First release. , 2015 .

[7] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[8] Robert Mansel Gower,et al. Higher-order reverse automatic differentiation with emphasis on the third-order , 2016, Math. Program..

[9] Benjamin Schrauwen,et al. Factoring Variations in Natural Images with Deep Gaussian Mixture Models , 2014, NIPS.

[10] Geoffrey E. Hinton,et al. Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[11] Y. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[12] Geoffrey E. Hinton,et al. On the importance of initialization and momentum in deep learning , 2013, ICML.

[13] Roger B. Grosse,et al. Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[14] Surya Ganguli,et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[15] Nicol N. Schraudolph,et al. Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent , 2002, Neural Computation.

[16] Andy Harter,et al. Parameterisation of a stochastic model for human face identification , 1994, Proceedings of 1994 IEEE Workshop on Applications of Computer Vision.

[17] Tom Schaul,et al. No more pesky learning rates , 2012, ICML.

[18] John Salvatier,et al. Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[19] Marc'Aurelio Ranzato,et al. Learning Factored Representations in a Deep Mixture of Experts , 2013, ICLR.

[20] Shun-ichi Amari,et al. Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[21] Heiga Zen,et al. Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22] Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[23] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.