论文信息 - A Limitation of Gradient Descent Learning

A Limitation of Gradient Descent Learning

Over decades, gradient descent has been applied to develop learning algorithm to train a neural network (NN). In this brief, a limitation of applying such algorithm to train an NN with persistent weight noise is revealed. Let <inline-formula> <tex-math notation="LaTeX">$V({\mathbf w})$ </tex-math></inline-formula> be the performance measure of an ideal NN. <inline-formula> <tex-math notation="LaTeX">$V({\mathbf w})$ </tex-math></inline-formula> is applied to develop the gradient descent learning (GDL). With weight noise, the desired performance measure (denoted as <inline-formula> <tex-math notation="LaTeX">${\mathcal{ J}}({\mathbf w})$ </tex-math></inline-formula>) is <inline-formula> <tex-math notation="LaTeX">$E[V(\tilde {\mathbf w})|{\mathbf w}]$ </tex-math></inline-formula>, where <inline-formula> <tex-math notation="LaTeX">$\tilde {\mathbf w}$ </tex-math></inline-formula> is the noisy weight vector. Applying GDL to train an NN with weight noise, the actual learning objective is clearly not <inline-formula> <tex-math notation="LaTeX">$V({\mathbf w})$ </tex-math></inline-formula> but another scalar function <inline-formula> <tex-math notation="LaTeX">${\mathcal{ L}}({\mathbf w})$ </tex-math></inline-formula>. For decades, there is a misconception that <inline-formula> <tex-math notation="LaTeX">${\mathcal{ L}}({\mathbf w}) = {\mathcal{ J}}({\mathbf w})$ </tex-math></inline-formula>, and hence, the actual model attained by the GDL is the desired model. However, we show that it might not: 1) with persistent additive weight noise, the actual model attained is the desired model as <inline-formula> <tex-math notation="LaTeX">${\mathcal{ L}}({\mathbf w}) = {\mathcal{ J}}({\mathbf w})$ </tex-math></inline-formula>; and 2) with persistent multiplicative weight noise, the actual model attained is unlikely the desired model as <inline-formula> <tex-math notation="LaTeX">${\mathcal{ L}}({\mathbf w}) \neq {\mathcal{ J}}({\mathbf w})$ </tex-math></inline-formula>. Accordingly, the properties of the models attained as compared with the desired models are analyzed and the learning curves are sketched. Simulation results on 1) a simple regression problem and 2) the MNIST handwritten digit recognition are presented to support our claims.

[1] Shu Hung Leung,et al. On the statistics of fixed-point roundoff error , 1985, IEEE Trans. Acoust. Speech Signal Process..

[2] Bruce Bomar. Finite Wordlength Effects , 2001 .

[3] Christopher M. Bishop,et al. Current address: Microsoft Research, , 2022 .

[4] A. Fettweis. On properties of floating-point roundoff noise , 1974 .

[5] Alan F. Murray. Analogue noise-enhanced learning in neural network circuits , 1991 .

[6] Yiran Chen,et al. Memristor crossbar based hardware realization of BSB recall function , 2012, The 2012 International Joint Conference on Neural Networks (IJCNN).

[7] Alan F. Murray,et al. Enhanced MLP performance and fault tolerance resulting from synaptic weight noise during training , 1994, IEEE Trans. Neural Networks.

[8] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[9] Ignacio Rojas,et al. A Quantitative Study of Fault Tolerance, Noise Immunity, and Generalization Ability of MLPs , 2000, Neural Computation.

[10] Degang Chen,et al. Analyses of Static and Dynamic Random Offset Voltages in Dynamic Comparators , 2009, IEEE Transactions on Circuits and Systems I: Regular Papers.

[11] Sander M. Bohte,et al. Error-backpropagation in temporally encoded networks of spiking neurons , 2000, Neurocomputing.

[12] Bede Liu,et al. On Local Roundoff Errors in Floating-Point Arithmetic , 1973, JACM.

[13] Simon Haykin,et al. Neural Networks and Learning Machines , 2010 .

[14] Andrew Chi-Sing Leung,et al. Convergence and Objective Functions of Some Fault/Noise-Injection-Based Online Learning Algorithms for RBF Networks , 2010, IEEE Transactions on Neural Networks.

[15] Kiyotoshi Matsuoka,et al. Noise injection into inputs in back-propagation learning , 1992, IEEE Trans. Syst. Man Cybern..

[16] Yves Grandvalet,et al. Comments on "Noise injection into inputs in back propagation learning" , 1995, IEEE Trans. Syst. Man Cybern..

[17] J. Sum,et al. On a multiple nodes fault tolerant training for RBF : Objective function , sensitivity analysis and relation to generalization , 2005 .

[18] C. Lee Giles,et al. An analysis of noise in recurrent neural networks: convergence and generalization , 1996, IEEE Trans. Neural Networks.

[19] Andrew Chi-Sing Leung,et al. Convergence Analyses on On-Line Weight Noise Injection-Based Training Algorithms for MLPs , 2012, IEEE Transactions on Neural Networks and Learning Systems.

[20] Yves Grandvalet,et al. Noise Injection: Theoretical Prospects , 1997, Neural Computation.

[21] Robert J. Marks,et al. Similarities of error regularization, sigmoid gain scaling, target smoothing, and training with jitter , 1995, IEEE Trans. Neural Networks.

[22] Robert G. Meyer,et al. Analysis and Design of Analog Integrated Circuits , 1993 .

[23] Alan F. Murray,et al. Synaptic weight noise during multilayer perceptron training: fault tolerance and training improvements , 1993, IEEE Trans. Neural Networks.

[24] Paul W. Munro,et al. Nets with Unreliable Hidden Nodes Learn Error-Correcting Codes , 1992, NIPS.

[25] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[26] C. H. Sequin,et al. Fault tolerance in artificial neural networks , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[27] Andrew Chi-Sing Leung,et al. Objective Functions of Online Weight Noise Injection Training Algorithms for MLPs , 2011, IEEE Transactions on Neural Networks.

[28] Guozhong An,et al. The Effects of Adding Noise During Backpropagation Training on a Generalization Performance , 1996, Neural Computation.

[29] Geoffrey E. Hinton,et al. Learning internal representations by error propagation , 1986 .

[30] K. Ozawa. Estimation for the mean and the variance of the round-off error , 1981 .

[31] Alan F. Murray. Multilayer Perceptron Learning Optimized for On-Chip Implementation: A Noise-Robust System , 1992, Neural Computation.