Hidden Layer Training via Hessian Matrix Information

The output weight optimization-hidden weight optimization (OWO-HWO) algorithm for training the multilayer perceptron alternately updates the output weights and the hidden weights. This layer-by-layer training strategy greatly improves convergence speed. However, in HWO, the desired net function actually evolves in the gradient direction, which inevitably reduces efficiency. In this paper, two improvements to the OWO-HWO algorithm are presented. New desired net functions are proposed for hidden layer training, which use Hessian matrix information rather than gradients. A weighted hidden layer error function, taking saturation into consideration, is derived directly from the global error function. Both techniques greatly increase training speed. Faster convergence is verified by simulations with remote sensing data sets. Introduction The multi-layer perceptron (MLP) is widely used in the fields of signal processing, remote sensing, and pattern recognition. Since back propagation (BP) was first proposed for MLP training (Werbos 1974), many researchers have attempted to improve its convergence speed. Techniques used to improve convergence include second order information (Battiti 1992, Mollor 1997), training network layer by layer (Wang and Chen 1996, Parisi et al. 1996), avoiding saturation (Yam and Chow 2000, Lee, Chen and Huang 2001) and adapting the learning factor (Magoulas, Vrahatis and Androulakis 1999, Nachtsheim 1994). Algorithms like Qprop (Fahlman 1989), conjugate gradient (Fletch 1987, Kim 2003), Levenberg-Marquardt (LM) (Hagan and Menhaj 1994, Fletch 1987) often perform much better than BP. In these algorithms, the essential difference is the weight updating strategy. The convergence speeds are quite different when the weights are modified in the gradient direction, a conjugate direction or the Newton direction. As to which one is better, this depends on the nature of the application, the computational load and other factors. Generally, gradient methods perform worst. While the Newton method performs best, it requires more computation time. Copyright © 2002, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. Chen (Chen, Manry and Chandrasekaran 1999) constructed a batch mode training algorithm called output weight optimization-hidden weight optimization (OWOHWO). In OWO-HWO, output weights and hidden unit weights are alternately modified to reduce the training error. The algorithm modifies the hidden weights based on the minimization of the MSE between a desired and the actual net function, as originally proposed by Scalero and Tepedelenlioglu (Scalero and Tepedelenlioglu 1992). Although, OWO-HWO greatly increases the training speed, it still has room for improvement (Wang and Chen 1996) because it uses the delta function, which is just the gradient information, as the desired net function change. In addition, HWO is equivalent to BP applied to the hidden weights under certain conditions (Chen, Manry and Chandrasekaran 1999). In this paper, a Newton-like method is used to improve hidden layer training. First, we review OWO-HWO training. Then, we propose new desired hidden layer net function changes using Hessian matrix information. Next, we derive a weighted hidden layer error function from the global training error function, which de-emphasizes error in saturated hidden units. We compare the improved training algorithm with the original OWO-HWO and LM algorithms with simulations on three remote sensing training data sets. The OWO-HWO Algorithm Without loss of generality, we restrict our discussion to a three layer fully connected MLP with linear output activation functions. First, we describe the network structure and our notation. Then we review the OWOHWO algorithm for our MLP. Fully Connected MLP Notation The network structure is shown in Fig. 1. For clarity, the bypass weights from input layer to output layer are not shown. The training data set consists of Nv training patterns {(xp, tp)}, where the pth input vector xp and the pth desired output vector tp have dimensions N and M, respectively. Thresholds in the hidden and output layers are handled by letting xp,(N+1)=1. For the jth hidden unit, the net input netpj and the output activation Opj for the pth training pattern are Fig. 1 The network structure

[1]  George D. Magoulas,et al.  Improving the Convergence of the Backpropagation Algorithm Using Learning Rate Adaptation Methods , 1999, Neural Computation.

[2]  T.,et al.  Training Feedforward Networks with the Marquardt Algorithm , 2004 .

[3]  Tommy W. S. Chow,et al.  A weight initialization method for improving training speed in feedforward neural network , 2000, Neurocomputing.

[4]  P. Nachtsheim,et al.  A first order adaptive learning rate algorithm for backpropagation networks , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[5]  R. Fletcher Practical Methods of Optimization , 1988 .

[6]  Chris Bishop,et al.  Current address: Microsoft Research, , 2022 .

[7]  Chih-Cheng Chen,et al.  A fast multilayer neural-network training algorithm based on the layer-by-layer optimizing procedures , 1996, IEEE Trans. Neural Networks.

[8]  Chih-Ming Chen,et al.  Learning efficiency improvement of back propagation algorithm by error saturation prevention method , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[9]  Michael T. Manry,et al.  A neural network training algorithm utilizing multiple sets of linear equations , 1996, Conference Record of The Thirtieth Asilomar Conference on Signals, Systems and Computers.

[10]  Kamal Sarabandi,et al.  An empirical model and an inversion technique for radar scattering from bare soil surfaces , 1992, IEEE Trans. Geosci. Remote. Sens..

[11]  Adrian K. Fung,et al.  Backscattering from a randomly rough dielectric surface , 1992, IEEE Trans. Geosci. Remote. Sens..

[12]  Bhaskar D. Rao,et al.  A generalized learning paradigm exploiting the structure of feedforward neural networks , 1996, IEEE Trans. Neural Networks.

[13]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[14]  M. F. Møller,et al.  Efficient Training of Feed-Forward Neural Networks , 1993 .

[15]  Nazif Tepedelenlioglu,et al.  A fast new algorithm for training feedforward neural networks , 1992, IEEE Trans. Signal Process..

[16]  Tae Kim,et al.  New learning factor and testing methods for conjugate gradient training algorithm , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[17]  Martin Fodslette Møller,et al.  A scaled conjugate gradient algorithm for fast supervised learning , 1993, Neural Networks.

[18]  Roberto Battiti,et al.  First- and Second-Order Methods for Learning: Between Steepest Descent and Newton's Method , 1992, Neural Computation.