论文信息 - Training Deep and Recurrent Networks with Hessian-Free Optimization

Training Deep and Recurrent Networks with Hessian-Free Optimization

In this chapter we will first describe the basic HF approach, and then examine well-known performance-improving techniques such as preconditioning which we have found to be beneficial for neural network training, as well as others of a more heuristic nature which are harder to justify, but which we have found to work well in practice. We will also provide practical tips for creating efficient and bug-free implementations and discuss various pitfalls which may arise when designing and using an HF-type approach in a particular application.

Ilya Sutskever | James Martens | Ilya Sutskever | James Martens | I. Sutskever

[1] M. Hestenes,et al. Methods of conjugate gradients for solving linear systems , 1952 .

[2] R. E. Wengert,et al. A simple automatic derivative evaluation program , 1964, Commun. ACM.

[3] Jorge J. Moré,et al. The Levenberg-Marquardt algo-rithm: Implementation and theory , 1977 .

[4] Philippe L. Toint,et al. Towards an efficient sparsity exploiting newton method for minimization , 1981 .

[5] T. Steihaug. The Conjugate Gradient Method and Trust Regions in Large Scale Optimization , 1983 .

[6] Jorge J. Moré,et al. Computing a Trust Region Step , 1983 .

[7] Y. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o(1/k^2) , 1983 .

[8] S. Nash. Newton-Type Minimization via the Lanczos Method , 1984 .

[9] Yann LeCun,et al. Improving the convergence of back-propagation learning with second-order methods , 1989 .

[10] Geoffrey E. Hinton,et al. Proceedings of the 1988 Connectionist Models Summer School , 1989 .

[11] Sepp Hochreiter,et al. Untersuchungen zu dynamischen neuronalen Netzen , 1991 .

[12] Chris Bishop,et al. Exact Calculation of the Hessian Matrix for the Multilayer Perceptron , 1992, Neural Computation.

[13] Dianne P. O'Leary,et al. The Use of the L-Curve in the Regularization of Discrete Ill-Posed Problems , 1993, SIAM J. Sci. Comput..

[14] J. Shewchuk. An Introduction to the Conjugate Gradient Method Without the Agonizing Pain , 1994 .

[15] Barak A. Pearlmutter. Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[16] Yoshua Bengio,et al. Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[17] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[18] Shun-ichi Amari,et al. Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[19] Nicholas I. M. Gould,et al. Solving the Trust-Region Subproblem using the Lanczos Method , 1999, SIAM J. Optim..

[20] Ya-Xiang Yuan,et al. On the truncated conjugate gradient method , 2000, Math. Program..

[21] S. Nash. A survey of truncated-Newton methods , 2000 .

[22] D K Smith,et al. Numerical Optimization , 2001, J. Oper. Res. Soc..

[23] J. van Leeuwen,et al. Neural Networks: Tricks of the Trade , 2002, Lecture Notes in Computer Science.

[24] Nicol N. Schraudolph,et al. Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent , 2002, Neural Computation.

[25] Harald Haas,et al. Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication , 2004, Science.

[26] Yoshua Bengio,et al. Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[27] Geoffrey E. Hinton,et al. Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[28] Yee Whye Teh,et al. A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[29] Léon Bottou,et al. The Tradeoffs of Large Scale Learning , 2007, NIPS.

[30] James Martens,et al. Deep learning via Hessian-free optimization , 2010, ICML.

[31] Razvan Pascanu,et al. Theano: A CPU and GPU Math Compiler in Python , 2010, SciPy.

[32] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[33] Ilya Sutskever,et al. Learning Recurrent Neural Networks with Hessian-Free Optimization , 2011, ICML.

[34] Jorge Nocedal,et al. On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning , 2011, SIAM J. Optim..

[35] Farhan Feroz,et al. BAMBI: blind accelerated multimodal Bayesian inference , 2011, 1110.2997.

[36] O. Chapelle. Improved Preconditioner for Hessian Free Optimization , 2011 .

[37] Geoffrey E. Hinton,et al. Generating Text with Recurrent Neural Networks , 2011, ICML.

[38] Yoshua Bengio,et al. Modeling Temporal Dependencies in High-Dimensional Sequences: Application to Polyphonic Music Generation and Transcription , 2012, ICML.

[39] Daniel Povey,et al. Krylov Subspace Descent for Deep Learning , 2011, AISTATS.

[40] Grgoire Montavon,et al. Neural Networks: Tricks of the Trade , 2012, Lecture Notes in Computer Science.

[41] Klaus-Robert Müller,et al. Efficient BackProp , 2012, Neural Networks: Tricks of the Trade.

[42] Tara N. Sainath,et al. Scalable Minimum Bayes Risk Training of Deep Neural Network Acoustic Models Using Distributed Hessian-free Optimization , 2012, INTERSPEECH.

[43] Jorge Nocedal,et al. Sample size selection in optimization methods for machine learning , 2012, Math. Program..

[44] Ilya Sutskever,et al. Estimating the Hessian by Back-propagating Curvature , 2012, ICML.

[45] Tara N. Sainath,et al. Accelerating Hessian-free optimization for Deep Neural Networks by implicit preconditioning and sampling , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[46] Tara N. Sainath,et al. Optimization Techniques to Improve Training Speed of Deep Neural Networks for Large Speech Tasks , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[47] Tara N. Sainath,et al. Improvements to Deep Convolutional Neural Networks for LVCSR , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[48] Razvan Pascanu,et al. Metric-Free Natural Gradient for Joint-Training of Boltzmann Machines , 2013, ICLR.

[49] Geoffrey E. Hinton,et al. On the importance of initialization and momentum in deep learning , 2013, ICML.

[50] Ryan Kiros,et al. Training Neural Networks with Stochastic Hessian-Free Optimization , 2013, ICLR.

[51] Farhan Feroz,et al. SKYNET: an efficient and robust neural network training tool for machine learning in astronomy , 2013, ArXiv.

[52] Razvan Pascanu,et al. Advances in optimizing recurrent networks , 2012, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.