An Analysis of' Noise in current Neural Networks: Generalization

There has been much interest in applying noise to feedforward neural networks in order to observe their effect on network performance. We extend these results by introduc- ing and analyzing various methods of injecting synaptic noise into dynamically driven recurrent networks during training. We present theoretical results which show that applying a controlled amount of noise during training may improve convergence and generalization performance. In addition, we analyze the effects of various noise parameters (additive versus multiplicative, cu- mulative versus noncumulative, per time step vs. per string) and predict that best overall performance can be achieved by injecting additive noise at each time step. Noise contributes a second-order gradient term to the error function which can be viewed as an anticipatory agent to aid convergence. This term appears to find promising regions of weight space in the beginning stages of training when the training error is large and should improve convergence on error surfaces with local minima. The first-order term can be interpreted as a regularization term that can improve generalization. Specifically, this term can encourage internal representations where the state nodes operate in the saturated regions of the sigmoid discriminant function. While this effect can improve performance on automata inference problems with binary inputs and target outputs, it is unclear what effect it will have on other types of problems. To substantiate these predictions, we present simulations on learning the dual parity grammar from temporal strings for all noise models, and present simulations on learning a randomly generated six-state grammar using the predicted best noise model.

[1]  Geoffrey E. Hinton,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[2]  Thomas Kailath,et al.  Model-free distributed learning , 1990, IEEE Trans. Neural Networks.

[3]  C. H. Sequin,et al.  Fault tolerance in artificial neural networks , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[4]  Robert M. Burton,et al.  Event-dependent control of noise enhances learning in neural networks , 1992, Neural Networks.

[5]  Raymond L. Watrous,et al.  Induction of Finite-State Languages Using Second-Order Recurrent Networks , 1992, Neural Computation.

[6]  J. I. Minnix Fault tolerance of the backpropagation neural network trained on noisy inputs , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[7]  Gert Cauwenberghs,et al.  A Fast Stochastic Error-Descent Algorithm for Supervised Learning and Optimization , 1992, NIPS.

[8]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[9]  Anders Krogh,et al.  A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[10]  Ronald J. Williams,et al.  Gradient-based learning algorithms for recurrent networks and their computational complexity , 1995 .

[11]  David Haussler,et al.  What Size Net Gives Valid Generalization? , 1989, Neural Computation.

[12]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[13]  Marwan A. Jabri,et al.  Summed Weight Neuron Perturbation: An O(N) Improvement Over Weight Perturbation , 1992, NIPS.

[14]  Padhraic Smyth,et al.  Learning Finite State Machines With Self-Clustering Recurrent Networks , 1993, Neural Computation.

[15]  Russell Reed,et al.  Pruning algorithms-a survey , 1993, IEEE Trans. Neural Networks.

[16]  C. Lee Giles,et al.  Experimental Comparison of the Effect of Order in Recurrent Neural Networks , 1993, Int. J. Pattern Recognit. Artif. Intell..

[17]  C. Lee Giles,et al.  Pruning recurrent neural networks for improved generalization performance , 1994, IEEE Trans. Neural Networks.

[18]  Stephen José Hanson,et al.  A stochastic version of the delta rule , 1990 .