Annealed online learning in multilayer neural networks

In this article we will examine online learning with an annealed learning rate. Annealing the learning rate is necessary if online learning is to reach its optimal solution. With a xed learning rate, the system will approximate the best solution only up to some uctuations. These uctuations are proportional to the size of the xed learning rate. It has been shown that an optimal annealing can make online learning asymptotically ecient meaning asymptotically it learns as fast as possible. These results are until now only realized in very simple networks, like single{layer perceptrons (section 3). Even the simplest multilayer network, the committee machine, shows an additional symptom, which makes straightforward annealing uneective. This is because, at the beginning of learning the committee machine is attracted by a metastable, suboptimal solution (section 4). The system stays in this metastable solution for a long time and can only leave it, if the learning rate is not too small. This delays the start of annealing considerably. Here we will show that a non{local or matrix update can prevent the system from becoming trapped in the metastable phase, allowing for annealing to start much earlier (section 5). Some remarks on the in BLOCKINuence of the initial conditions and a possible candidate for a theoretical support are discussed in section 6. The paper ends with a summary of future tasks and a conclusion. 1 Introduction One of the most attractive properties of articial neural networks is their ability to learn from examples and to generalize the acquired knowledge to unknown data. Recently, online learning, as opposed to batch or oine learning , became very popular, see In online learning the weights are updated by using only one example x(t) at a time t, i.e. W(t + 1) = W(t) + 1W[x(t);z 3 (t);W(t)]; (1:1) where is the learning rate and z 3 (t) the correct target output in the case of supervised learning. The advantages of online learning are obvious. No memory is needed to store all examples and recent examples can be emphasized

[1]  Andreas Engel,et al.  On-line Learning in Multilayer Networks , 2001 .

[2]  Magnus Rattray,et al.  Globally Optimal On-line Learning Rules , 1997, NIPS.

[3]  Michael Biehl,et al.  Transient dynamics of on-line learning in two-layered neural networks , 1996 .

[4]  Shun-ichi Amari,et al.  A Theory of Adaptive Pattern Classifiers , 1967, IEEE Trans. Electron. Comput..

[5]  S. Amari Natural Gradient Works Eciently in Learning , 2022 .

[6]  David Saad,et al.  On-line learning with adaptive back-propagation in two-layer networks , 1997 .

[7]  S. Bös STATISTICAL MECHANICS APPROACH TO EARLY STOPPING AND WEIGHT DECAY , 1998 .

[8]  Opper On-line versus Off-line Learning from Random Examples: General Results. , 1996, Physical review letters.

[9]  Nestor Caticha,et al.  Functional optimization of online algorithms in multilayer neural networks , 1997 .

[10]  Saad,et al.  On-line learning in soft committee machines. , 1995, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[11]  Howard Hua Yang,et al.  Natural Gradient Descent for Training Multi-Layer Perceptrons , 1997 .

[12]  Saad,et al.  Exact solution for on-line learning in multilayer neural networks. , 1995, Physical review letters.

[13]  Ansgar Heinrich Ludolf West,et al.  Adaptive Back-Propagation in On-Line Learning of Multilayer Networks , 1995, NIPS.

[14]  Julius-Maximilians-Uni,et al.  Learning drifting concepts with neural networks , 1992 .