A Limitation of Gradient Descent Learning

Over decades, gradient descent has been applied to develop learning algorithm to train a neural network (NN). In this brief, a limitation of applying such algorithm to train an NN with persistent weight noise is revealed. Let <inline-formula> <tex-math notation="LaTeX">$V({\mathbf w})$ </tex-math></inline-formula> be the performance measure of an ideal NN. <inline-formula> <tex-math notation="LaTeX">$V({\mathbf w})$ </tex-math></inline-formula> is applied to develop the gradient descent learning (GDL). With weight noise, the desired performance measure (denoted as <inline-formula> <tex-math notation="LaTeX">${\mathcal{ J}}({\mathbf w})$ </tex-math></inline-formula>) is <inline-formula> <tex-math notation="LaTeX">$E[V(\tilde {\mathbf w})|{\mathbf w}]$ </tex-math></inline-formula>, where <inline-formula> <tex-math notation="LaTeX">$\tilde {\mathbf w}$ </tex-math></inline-formula> is the noisy weight vector. Applying GDL to train an NN with weight noise, the actual learning objective is clearly not <inline-formula> <tex-math notation="LaTeX">$V({\mathbf w})$ </tex-math></inline-formula> but another scalar function <inline-formula> <tex-math notation="LaTeX">${\mathcal{ L}}({\mathbf w})$ </tex-math></inline-formula>. For decades, there is a misconception that <inline-formula> <tex-math notation="LaTeX">${\mathcal{ L}}({\mathbf w}) = {\mathcal{ J}}({\mathbf w})$ </tex-math></inline-formula>, and hence, the actual model attained by the GDL is the desired model. However, we show that it might not: 1) with persistent additive weight noise, the actual model attained is the desired model as <inline-formula> <tex-math notation="LaTeX">${\mathcal{ L}}({\mathbf w}) = {\mathcal{ J}}({\mathbf w})$ </tex-math></inline-formula>; and 2) with persistent multiplicative weight noise, the actual model attained is unlikely the desired model as <inline-formula> <tex-math notation="LaTeX">${\mathcal{ L}}({\mathbf w}) \neq {\mathcal{ J}}({\mathbf w})$ </tex-math></inline-formula>. Accordingly, the properties of the models attained as compared with the desired models are analyzed and the learning curves are sketched. Simulation results on 1) a simple regression problem and 2) the MNIST handwritten digit recognition are presented to support our claims.

[1]  Shu Hung Leung,et al.  On the statistics of fixed-point roundoff error , 1985, IEEE Trans. Acoust. Speech Signal Process..

[2]  Bruce Bomar Finite Wordlength Effects , 2001 .

[3]  Christopher M. Bishop,et al.  Current address: Microsoft Research, , 2022 .

[4]  A. Fettweis On properties of floating-point roundoff noise , 1974 .

[5]  Alan F. Murray Analogue noise-enhanced learning in neural network circuits , 1991 .

[6]  Yiran Chen,et al.  Memristor crossbar based hardware realization of BSB recall function , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[7]  Alan F. Murray,et al.  Enhanced MLP performance and fault tolerance resulting from synaptic weight noise during training , 1994, IEEE Trans. Neural Networks.

[8]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[9]  Ignacio Rojas,et al.  A Quantitative Study of Fault Tolerance, Noise Immunity, and Generalization Ability of MLPs , 2000, Neural Computation.

[10]  Degang Chen,et al.  Analyses of Static and Dynamic Random Offset Voltages in Dynamic Comparators , 2009, IEEE Transactions on Circuits and Systems I: Regular Papers.

[11]  Sander M. Bohte,et al.  Error-backpropagation in temporally encoded networks of spiking neurons , 2000, Neurocomputing.

[12]  Bede Liu,et al.  On Local Roundoff Errors in Floating-Point Arithmetic , 1973, JACM.

[13]  Simon Haykin,et al.  Neural Networks and Learning Machines , 2010 .

[14]  Andrew Chi-Sing Leung,et al.  Convergence and Objective Functions of Some Fault/Noise-Injection-Based Online Learning Algorithms for RBF Networks , 2010, IEEE Transactions on Neural Networks.

[15]  Kiyotoshi Matsuoka,et al.  Noise injection into inputs in back-propagation learning , 1992, IEEE Trans. Syst. Man Cybern..

[16]  Yves Grandvalet,et al.  Comments on "Noise injection into inputs in back propagation learning" , 1995, IEEE Trans. Syst. Man Cybern..

[17]  J. Sum,et al.  On a multiple nodes fault tolerant training for RBF : Objective function , sensitivity analysis and relation to generalization , 2005 .

[18]  C. Lee Giles,et al.  An analysis of noise in recurrent neural networks: convergence and generalization , 1996, IEEE Trans. Neural Networks.

[19]  Andrew Chi-Sing Leung,et al.  Convergence Analyses on On-Line Weight Noise Injection-Based Training Algorithms for MLPs , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[20]  Yves Grandvalet,et al.  Noise Injection: Theoretical Prospects , 1997, Neural Computation.

[21]  Robert J. Marks,et al.  Similarities of error regularization, sigmoid gain scaling, target smoothing, and training with jitter , 1995, IEEE Trans. Neural Networks.

[22]  Robert G. Meyer,et al.  Analysis and Design of Analog Integrated Circuits , 1993 .

[23]  Alan F. Murray,et al.  Synaptic weight noise during multilayer perceptron training: fault tolerance and training improvements , 1993, IEEE Trans. Neural Networks.

[24]  Paul W. Munro,et al.  Nets with Unreliable Hidden Nodes Learn Error-Correcting Codes , 1992, NIPS.

[25]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[26]  C. H. Sequin,et al.  Fault tolerance in artificial neural networks , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[27]  Andrew Chi-Sing Leung,et al.  Objective Functions of Online Weight Noise Injection Training Algorithms for MLPs , 2011, IEEE Transactions on Neural Networks.

[28]  Guozhong An,et al.  The Effects of Adding Noise During Backpropagation Training on a Generalization Performance , 1996, Neural Computation.

[29]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[30]  K. Ozawa Estimation for the mean and the variance of the round-off error , 1981 .

[31]  Alan F. Murray Multilayer Perceptron Learning Optimized for On-Chip Implementation: A Noise-Robust System , 1992, Neural Computation.