Effiicient BackProp

The convergence of back-propagation learning is analyzed so as to explain common phenomenon observed by practitioners. Many undesirable behaviors of backprop can be avoided with tricks that are rarely exposed in serious technical publications. This paper gives some of those tricks, and offers explanations of why they work. Many authors have suggested that second-order optimization methods are advantageous for neural net training. It is shown that most “classical” second-order methods are impractical for large neural networks. A few methods are proposed that do not have these limitations.

[1]  I. G. BONNER CLAPPISON Editor , 1960, The Electric Power Engineering Handbook - Five Volume Set.

[2]  Alan V. Oppenheim,et al.  Digital Signal Processing , 1978, IEEE Transactions on Systems, Man, and Cybernetics.

[3]  Yann LeCun PhD thesis: Modeles connexionnistes de l'apprentissage (connectionist learning models) , 1987 .

[4]  Robert A. Jacobs,et al.  Increased rates of convergence through learning rate adaptation , 1987, Neural Networks.

[5]  Alberto L. Sangiovanni-Vincentelli,et al.  Efficient Parallel Learning Algorithms for Neural Networks , 1988, NIPS.

[6]  D. Broomhead,et al.  Radial Basis Functions, Multi-Variable Functional Interpolation and Adaptive Networks , 1988 .

[7]  Yann LeCun,et al.  Generalization and network design strategies , 1989 .

[8]  John Moody,et al.  Fast Learning in Networks of Locally-Tuned Processing Units , 1989, Neural Computation.

[9]  Yann LeCun,et al.  Improving the convergence of back-propagation learning with second-order methods , 1989 .

[10]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[11]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[12]  John E. Moody,et al.  Note on Learning Rate Schedules for Stochastic Optimization , 1990, NIPS.

[13]  Yann LeCun,et al.  Second Order Properties of Error Surfaces , 1990, NIPS.

[14]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[15]  Barak A. Pearlmutter,et al.  Automatic Learning Rate Maximization by On-Line Estimation of the Hessian's Eigenvectors , 1992, NIPS 1992.

[16]  Roberto Battiti,et al.  First- and Second-Order Methods for Learning: Between Steepest Descent and Newton's Method , 1992, Neural Computation.

[17]  Richard S. Sutton,et al.  Adapting Bias by Gradient Descent: An Incremental Version of Delta-Bar-Delta , 1992, AAAI.

[18]  M. Moller,et al.  Supervised learning on large redundant training sets , 1992, Neural Networks for Signal Processing II Proceedings of the 1992 IEEE Workshop.

[19]  Martin Fodslette Møller,et al.  A scaled conjugate gradient algorithm for fast supervised learning , 1993, Neural Networks.

[20]  Hilbert J. Kappen,et al.  On-line learning processes in artificial neural networks , 1993 .

[21]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[22]  W. Wiegerinck,et al.  Stochastic dynamics of learning with momentum in neural networks , 1994 .

[23]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[24]  Mark J. L. Orr,et al.  Regularization in the Selection of Radial Basis Function Centers , 1995, Neural Computation.

[25]  Saad,et al.  Exact solution for on-line learning in multilayer neural networks. , 1995, Physical review letters.

[26]  Haim Sompolinsky,et al.  On-line Learning of Dichotomies: Algorithms and Learning Curves. , 1995, NIPS 1995.

[27]  Genevieve B. Orr,et al.  Removing Noise in On-Line Search using Adaptive Batch Sizes , 1996, NIPS.

[28]  Andreas Ziehe,et al.  Adaptive On-line Learning in Changing Environments , 1996, NIPS.

[29]  Shun-ichi Amari,et al.  Neural Learning in Structured Parameter Spaces - Natural Riemannian Gradient , 1996, NIPS.

[30]  Shun-ichi Amari,et al.  The Efficiency and the Robustness of Natural Gradient Descent Learning Rule , 1997, NIPS.

[31]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.