Iterative Scaled Trust-Region Learning in Krylov Subspaces via Pearlmutter's Implicit Sparse Hessian-Vector Multiply

The online incremental gradient (or backpropagation) algorithm is widely considered to be the fastest method for solving large-scale neural-network (NN) learning problems. In contrast, we show that an appropriately implemented iterative batch-mode (or block-mode) learning method can be much faster. For example, it is three times faster in the UCI letter classification problem (26 outputs, 16,000 data items, 6,066 parameters with a two-hidden-layer multilayer perceptron) and 353 times faster in a nonlinear regression problem arising in color recipe prediction (10 outputs, 1,000 data items, 2,210 parameters with a neuro-fuzzy modular network). The three principal innovative ingredients in our algorithm are the following: First, we use scaled trust-region regularization with inner-outer iteration to solve the associated "overdetermined" nonlinear least squares problem, where the inner iteration performs a truncated (or inexact) Newton method. Second, we employ Pearlmutter's implicit sparse Hessian matrix-vector multiply algorithm to construct the Krylov subspaces used to solve for the truncated Newton update. Third, we exploit sparsity (for preconditioning) in the matrices resulting from the NNs having many outputs.

[1]  Jorge J. Moré,et al.  Computing a Trust Region Step , 1983 .

[2]  Nicholas I. M. Gould,et al.  Trust Region Methods , 2000, MOS-SIAM Series on Optimization.

[3]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[4]  John E. Dennis,et al.  An Adaptive Nonlinear Least-Squares Algorithm , 1977, TOMS.

[5]  James Demmel,et al.  Applied Numerical Linear Algebra , 1997 .

[6]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[7]  T. Steihaug The Conjugate Gradient Method and Trust Regions in Large Scale Optimization , 1983 .

[8]  Stuart E. Dreyfus,et al.  On complexity analysis of supervised MLP-learning for algorithmic comparisons , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[9]  E. Mizutani,et al.  On separable nonlinear least squares algorithms for neuro-fuzzy modular network learning , 2002, Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 (Cat. No.02CH37290).

[10]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[11]  James Demmel,et al.  On structure-exploiting trust-region regularized nonlinear least squares algorithms for neural-network learning , 2003, Neural Networks.

[13]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[14]  Yoshua Bengio,et al.  Boosting Neural Networks , 2000, Neural Computation.