Iterative weighted least squares algorithms for neural networks classifiers

This paper discusses learning algorithms of layered neural networks from the standpoint of maximum likelihood estimation. Fisher information is explicitly calculated for the network with only one neuron. It can be interpreted as a weighted covariance matrix of input vectors. A learning algorithm is presented on the basis of Fisher's scoring method. It is shown that the algorithm can be interpreted as iterations of weighted least square method. Then those results are extended to the layered network with one hidden layer. It is also shown that Fisher information is given as a weighted covariance matrix of inputs and outputs of hidden units for this network. Tow new algorithms are proposed by utilizing this information. It is experimentally shown that the algorithms converge with fewer iterations than usual BP algorithm. Especially UFS (unitwise Fisher's scoring) method reduces to the algorithm in which each unit estimates its own weights by a weighted least squares method.

[1]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[2]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[3]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[4]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[5]  Geoffrey E. Hinton,et al.  Learning representations by back-propagation errors, nature , 1986 .

[6]  J. Rissanen Stochastic Complexity and Modeling , 1986 .

[7]  Eric B. Baum,et al.  Supervised Learning of Probability Distributions by Neural Networks , 1987, NIPS.

[8]  Scott E. Fahlman,et al.  An empirical study of learning speed in back-propagation networks , 1988 .

[9]  P. McCullagh,et al.  Generalized Linear Models , 1992 .

[10]  Alexander H. Waibel,et al.  A novel objective function for improved phoneme recognition using time delay neural networks , 1990, International 1989 Joint Conference on Neural Networks.

[11]  John S. Bridle,et al.  Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters , 1989, NIPS.

[12]  Geoffrey E. Hinton Connectionist Learning Procedures , 1989, Artif. Intell..

[13]  H. Gish,et al.  A probabilistic approach to the understanding and training of neural network classifiers , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[14]  M.J.J. Holt,et al.  Convergence of back-propagation in neural networks using a log-likelihood cost function , 1990 .

[15]  Richard Lippmann,et al.  Neural Network Classifiers Estimate Bayesian a posteriori Probabilities , 1991, Neural Computation.