Theoretical analysis of batch and on-line training for gradient descent learning in neural networks

In this study, we theoretically analyze two essential training schemes for gradient descent learning in neural networks: batch and on-line training. The convergence properties of the two schemes applied to quadratic loss functions are analytically investigated. We quantify the convergence of each training scheme to the optimal weight using the absolute value of the expected difference (Measure 1) and the expected squared difference (Measure 2) between the optimal weight and the weight computed by the scheme. Although on-line training has several advantages over batch training with respect to the first measure, it does not converge to the optimal weight with respect to the second measure if the variance of the per-instance gradient remains constant. However, if the variance decays exponentially, then on-line training converges to the optimal weight with respect to Measure 2. Our analysis reveals the exact degrees to which the training set size, the variance of the per-instance gradient, and the learning rate affect the rate of convergence for each scheme.

[1]  Laurene V. Fausett,et al.  Fundamentals Of Neural Networks , 1993 .

[2]  M.H. Hassoun,et al.  Fundamentals of Artificial Neural Networks , 1996, Proceedings of the IEEE.

[3]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[4]  Yoshua Bengio,et al.  Neural networks for speech and sequence recognition , 1996 .

[5]  Tom Heskes,et al.  A theoretical comparison of batch-mode, on-line, cyclic, and almost-cyclic learning , 1996, IEEE Trans. Neural Networks.

[6]  Amir F. Atiya,et al.  New results on recurrent network training: unifying the algorithms and accelerating convergence , 2000, IEEE Trans. Neural Networks Learn. Syst..

[7]  Peter L. Bartlett,et al.  Learning in Neural Networks: Theoretical Foundations , 1999 .

[8]  Tony R. Martinez,et al.  The general inefficiency of batch training for gradient descent learning , 2003, Neural Networks.

[9]  B. S. Lim,et al.  Optimal design of neural networks using the Taguchi method , 1995, Neurocomputing.

[10]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[11]  H. Robbins A Stochastic Approximation Method , 1951 .

[12]  Peter L. Bartlett,et al.  Efficient agnostic learning of neural networks with bounded fan-in , 1996, IEEE Trans. Inf. Theory.

[13]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[14]  Philip D. Wasserman,et al.  Advanced methods in neural computing , 1993, VNR computer library.

[15]  Françoise Fogelman-Soulié,et al.  Speaker-independent isolated digit recognition: Multilayer perceptrons vs. Dynamic time warping , 1990, Neural Networks.

[16]  Sholom M. Weiss,et al.  Computer Systems That Learn , 1990 .

[17]  Yann LeCun,et al.  Improving the convergence of back-propagation learning with second-order methods , 1989 .

[18]  Shun-Feng Su,et al.  Robust support vector regression networks for function approximation with outliers , 2002, IEEE Trans. Neural Networks.

[19]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[20]  Carlos E. Pedreira,et al.  Neural networks for short-term load forecasting: a review and evaluation , 2001 .

[21]  Laurene V. Fausett,et al.  Fundamentals Of Neural Networks , 1994 .

[22]  Robert J. Marks,et al.  Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks , 1999 .

[23]  Jose C. Principe,et al.  Neural and adaptive systems , 2000 .