Incremental training of first order recurrent neural networks to predict a context-sensitive language

In recent years it has been shown that first order recurrent neural networks trained by gradient-descent can learn not only regular but also simple context-free and context-sensitive languages. However, the success rate was generally low and severe instability issues were encountered. The present study examines the hypothesis that a combination of evolutionary hill climbing with incremental learning and a well-balanced training set enables first order recurrent networks to reliably learn context-free and mildly context-sensitive languages. In particular, we trained the networks to predict symbols in string sequences of the context-sensitive language [a(n)b(n)c(n); n > or = 1. Comparative experiments with and without incremental learning indicated that incremental learning can accelerate and facilitate training. Furthermore, incrementally trained networks generally resulted in monotonic trajectories in hidden unit activation space, while the trajectories of non-incrementally trained networks were oscillating. The non-incrementally trained networks were more likely to generalise.

[1]  Stephan K. Chalup,et al.  Incremental Learning in Biological and Machine Learning Systems , 2002, Int. J. Neural Syst..

[2]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[3]  Cristopher Moore,et al.  Dynamical Recognizers: Real-Time Language Recognition by Analog Computers , 1998, Theor. Comput. Sci..

[4]  S. Kirby,et al.  The evolution of incremental learning: language, development and critical periods , 1997 .

[5]  Hans-Paul Schwefel,et al.  Evolution and Optimum Seeking: The Sixth Generation , 1993 .

[6]  Ingo Rechenberg,et al.  Evolutionsstrategie '94 , 1994, Werkstatt Bionik und Evolutionstechnik.

[7]  Padraic Monaghan,et al.  Proceedings of the 23rd annual conference of the cognitive science society , 2001 .

[8]  K. Doya,et al.  Bifurcations in the learning of recurrent neural networks , 1992, [Proceedings] 1992 IEEE International Symposium on Circuits and Systems.

[9]  Barak A. Pearlmutter Learning State Space Trajectories in Recurrent Neural Networks , 1989, Neural Computation.

[10]  Stefan C. Kremer,et al.  Identification of a specific limitation on local-feedback recurrent networks acting as Mealy-Moore machines , 1999, IEEE Trans. Neural Networks.

[11]  Jordan B. Pollack,et al.  RAAM for infinite context-free languages , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[12]  D. Signorini,et al.  Neural networks , 1995, The Lancet.

[13]  Eduardo D. Sontag,et al.  Analog Neural Nets with Gaussian or Other Common Noise Distributions Cannot Recognize Arbitrary Regular Languages , 1999, Neural Computation.

[14]  Carlos Martín-Vide,et al.  Sewing contexts and mildly context-sensitive languages , 2001, Where Mathematics, Computer Science, Linguistics and Biology Meet.

[15]  Robert Frank,et al.  From Regular to Context Free to Mildly Context Sensitive Tree Rewriting Systems: The Path of Child Language Acquisition , 1994, ArXiv.

[16]  James L. McClelland,et al.  Finite State Automata and Simple Recurrent Networks , 1989, Neural Computation.

[17]  M. Hirsch The dynamical systems approach to differential equations , 1984 .

[18]  Janet Wiles,et al.  Representation beyond finite states: Alternatives to pushdown automata , 2001 .

[19]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[20]  Kenji Doya,et al.  Recurrent networks: supervised learning , 1998 .

[21]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[22]  Jeffrey Horn,et al.  Handbook of evolutionary computation , 1997 .

[23]  Eduardo D. Sontag Automata and neural networks , 1998 .

[24]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[25]  Marco Gori,et al.  Adaptive Processing of Sequences and Data Structures , 1998, Lecture Notes in Computer Science.

[26]  Jordan B. Pollack,et al.  Analysis of Dynamical Recognizers , 1997, Neural Computation.

[27]  Helko Lehmann,et al.  Designing a Counter: Another Case Study of Dynamics and Activation Landscapes in Recurrent Networks , 1997, KI.

[28]  Janet Wiles,et al.  Context-free and context-sensitive dynamics in recurrent neural networks , 2000, Connect. Sci..

[29]  Peter Tiño,et al.  Attractive Periodic Sets in Discrete-Time Recurrent Networks (with Emphasis on Fixed-Point Stability and Bifurcations in Two-Neuron Networks) , 2001, Neural Computation.

[30]  Tom Ziemke,et al.  Evolving context-free language predictors , 2000, GECCO.

[31]  Janet Wiles,et al.  On learning context-free and context-sensitive languages , 2002, IEEE Trans. Neural Networks.

[32]  Jürgen Schmidhuber,et al.  LSTM recurrent networks learn simple context-free and context-sensitive languages , 2001, IEEE Trans. Neural Networks.

[33]  Paul Rodríguez,et al.  Simple Recurrent Networks Learn Context-Free and Context-Sensitive Languages by Counting , 2001, Neural Computation.

[34]  Aravind K. Joshi,et al.  Tree-Adjoining Grammars , 1997, Handbook of Formal Languages.

[35]  Eduardo D. Sontag,et al.  A Precise Characterization of the Class of Languages Recognized by Neural Nets under Gaussian and Other Common Noise Distributions , 1998, NIPS.

[36]  Hans-Georg Beyer,et al.  The Theory of Evolution Strategies , 2001, Natural Computing Series.

[37]  C. Lee Giles,et al.  Using Prior Knowledge in a {NNPDA} to Learn Context-Free Languages , 1992, NIPS.

[38]  Michael A. Arbib,et al.  The handbook of brain theory and neural networks , 1995, A Bradford book.

[39]  J. Pollack The Induction of Dynamical Recognizers , 1996, Machine Learning.

[40]  Janet Wiles,et al.  Learning to predict a context-free language: analysis of dynamics in recurrent hidden units , 1999 .

[41]  Marvin Minsky,et al.  Computation : finite and infinite machines , 2016 .

[42]  Janet Wiles,et al.  Inductive Bias in Context-Free Language Learning , 1998 .

[43]  Stephan K. Chalup,et al.  Hill climbing in recurrent neural networks for learning the a/sup n/b/sup n/c/sup n/ language , 1999, ICONIP'99. ANZIIS'99 & ANNES'99 & ACNN'99. 6th International Conference on Neural Information Processing. Proceedings (Cat. No.99EX378).

[44]  C. Lee Giles,et al.  Learning and Extracting Finite State Automata with Second-Order Recurrent Neural Networks , 1992, Neural Computation.

[45]  Janet Wiles,et al.  Learning to count without a counter: A case study of dynamics and activation landscapes in recurrent networks , 1995 .

[46]  Anders Krogh,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[47]  David C. Plaut,et al.  Simple Recurrent Networks and Natural Language: How Important is Starting Small? , 1997 .

[48]  Mark Steijvers,et al.  A Recurrent Network that performs a Context-Sensitive Prediction Task , 1996 .

[49]  J. Kolen Recurrent Networks: State Machines Or Iterated Function Systems? , 1994 .

[50]  Ah Chung Tsoi,et al.  Discrete time recurrent neural network architectures: A unifying review , 1997, Neurocomputing.

[51]  John F. Kolen,et al.  Field Guide to Dynamical Recurrent Networks , 2001 .

[52]  Xin Yao,et al.  A new evolutionary system for evolving artificial neural networks , 1997, IEEE Trans. Neural Networks.

[53]  Janet Wiles,et al.  Learning a context-free task with a recurrent neural network: An analysis of stability , 1999 .

[54]  Learning and Extracting Initial Mealy Automata with a Modular Neural Network Model , 1995, Neural Computation.

[55]  Mike Casey,et al.  The Dynamics of Discrete-Time Computation, with Application to Recurrent Neural Networks and Finite State Machine Extraction , 1996, Neural Computation.

[56]  Padhraic Smyth,et al.  Discrete recurrent neural networks for grammatical inference , 1994, IEEE Trans. Neural Networks.

[57]  Xin Yao,et al.  Fast Evolution Strategies , 1997, Evolutionary Programming.

[58]  守屋 悦朗,et al.  J.E.Hopcroft, J.D. Ullman 著, "Introduction to Automata Theory, Languages, and Computation", Addison-Wesley, A5変形版, X+418, \6,670, 1979 , 1980 .

[59]  Jordan B. Pollack,et al.  Co-Evolution in the Successful Learning of Backgammon Strategy , 1998, Machine Learning.

[60]  Stephan K. Chalup,et al.  A study on hill climbing algorithms for neural network training , 1999, Proceedings of the 1999 Congress on Evolutionary Computation-CEC99 (Cat. No. 99TH8406).

[61]  Pekka Orponen,et al.  On the Effect of Analog Noise in Discrete-Time Analog Computations , 1996, Neural Computation.

[62]  Hans-Paul Schwefel,et al.  Evolution and optimum seeking , 1995, Sixth-generation computer technology series.

[63]  Noam Chomsky,et al.  On Certain Formal Properties of Grammars , 1959, Inf. Control..

[64]  J. Elman Distributed Representations, Simple Recurrent Networks, And Grammatical Structure , 1991 .

[65]  Douglas L. T. Rohde,et al.  Language acquisition in the absence of explicit negative evidence: how important is starting small? , 1999, Cognition.

[66]  Mw Hirsch,et al.  Network Dynamics: Principles and Problems , 1991 .

[67]  Peter J. Angeline,et al.  An evolutionary algorithm that constructs recurrent neural networks , 1994, IEEE Trans. Neural Networks.

[68]  David S. Touretzky,et al.  Connectionist Approaches to Language Learning , 1991 .

[69]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[70]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[71]  Marco Gori,et al.  Adaptive processing of sequences and data structures : International Summer School on Neural Networks "E.R. Caianiello", Vietri sul Mare, Salerno, Italy, September 6-13, 1997, tutorial lectures , 1998 .

[72]  Owen Rambow,et al.  Tree adjoining grammars : formalisms, linguistic analysis, and processing , 2000 .

[73]  Hava T. Siegelmann,et al.  Neural networks and analog computation - beyond the Turing limit , 1999, Progress in theoretical computer science.

[74]  David J. Weir,et al.  The convergence of mildly context-sensitive grammar formalisms , 1990 .

[75]  Giovanni Soda,et al.  Inductive inference from noisy examples using the hybrid finite state filter , 1998, IEEE Trans. Neural Networks.

[76]  Morris W. Hirsch,et al.  Convergent activation dynamics in continuous time networks , 1989, Neural Networks.

[77]  Stefan C. Kremer,et al.  On the computational power of Elman-style recurrent networks , 1995, IEEE Trans. Neural Networks.

[78]  Paul Rodríguez,et al.  A Recurrent Neural Network that Learns to Count , 1999, Connect. Sci..

[79]  Barak A. Pearlmutter Gradient calculations for dynamic recurrent neural networks: a survey , 1995, IEEE Trans. Neural Networks.

[80]  Zbigniew Michalewicz,et al.  Handbook of Evolutionary Computation , 1997 .

[81]  Douglas L. T. Rohde,et al.  Less is Less in Language Acquisition , 2001 .

[82]  Ah Chung Tsoi,et al.  Recurrent Neural Network Architectures: An Overview , 1997, Summer School on Neural Networks.

[83]  Mikel L. Forcada,et al.  Stable Encoding of Finite-State Machines in Discrete-Time Recurrent Neural Nets with Sigmoid Units , 2000, Neural Computation.

[84]  David Zipser,et al.  Subgrouping Reduces Complexity and Speeds Up Learning in Recurrent Networks , 1989, NIPS.

[85]  J. Elman Learning and development in neural networks: the importance of starting small , 1993, Cognition.