Smoothing of cost function leads to faster convergence of neural network learning

One of the major problems in supervised learning of neural networks is the inevitable local minima inherent in the cost function f(W,D). This often makes classic gradient-descent-based learning algorithms that calculate the weight updates for each iteration according to (Delta) W(t) equals -(eta) (DOT)$DELwf(W,D) powerless. In this paper we describe a new strategy to solve this problem, which, adaptively, changes the learning rate and manipulates the gradient estimator simultaneously. The idea is to implicitly convert the local- minima-laden cost function f((DOT)) into a sequence of its smoothed versions {f(beta t)}Ttequals1, which, subject to the parameter (beta) t, bears less details at time t equals 1 and gradually more later on, the learning is actually performed on this sequence of functionals. The corresponding smoothed global minima obtained in this way, {Wt}Ttequals1, thus progressively approximate W--the desired global minimum. Experimental results on a nonconvex function minimization problem and a typical neural network learning task are given, analyses and discussions of some important issues are provided.

[1]  M. A. Styblinski,et al.  Experiments in nonconvex optimization: Stochastic approximation with function smoothing and simulated annealing , 1990, Neural Networks.

[2]  J. Spall Multivariate stochastic approximation using a simultaneous perturbation gradient approximation , 1992 .

[3]  Esther Levin,et al.  Accelerated Learning in Layered Neural Networks , 1988, Complex Syst..

[4]  Gert Cauwenberghs,et al.  A Fast Stochastic Error-Descent Algorithm for Supervised Learning and Optimization , 1992, NIPS.

[5]  Reuven Y. Rubinstein,et al.  Simulation and the Monte Carlo method , 1981, Wiley series in probability and mathematical statistics.

[6]  Donald Geman,et al.  Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images , 1984 .

[7]  S. Makram-Ebeid,et al.  A rationalized error back-propagation learning algorithm , 1989, International 1989 Joint Conference on Neural Networks.

[8]  Radford M. Neal Bayesian training of backpropagation networks by the hybrid Monte-Carlo method , 1992 .

[9]  Geoffrey E. Hinton,et al.  Simplifying Neural Networks by Soft Weight-Sharing , 1992, Neural Computation.

[10]  Marwan A. Jabri,et al.  Summed Weight Neuron Perturbation: An O(N) Improvement Over Weight Perturbation , 1992, NIPS.

[11]  Robert A. Jacobs,et al.  Increased rates of convergence through learning rate adaptation , 1987, Neural Networks.

[12]  John E. Moody,et al.  Note on Learning Rate Schedules for Stochastic Optimization , 1990, NIPS.

[13]  G. Pflug Stochastic Approximation Methods for Constrained and Unconstrained Systems - Kushner, HJ.; Clark, D.S. , 1980 .

[14]  John Moody,et al.  Learning rate schedules for faster stochastic gradient search , 1992, Neural Networks for Signal Processing II Proceedings of the 1992 IEEE Workshop.