Recurrent Neural Networks with Small Weights Implement Definite Memory Machines

Recent experimental studies indicate that recurrent neural networks initialized with small weights are inherently biased toward definite memory machines (Tino, eransk, & Beukov, 2002a, 2002b). This article establishes a theoretical counterpart: transition function of recurrent network with small weights and squashing activation function is a contraction. We prove that recurrent networks with contractive transition function can be approximated arbitrarily well on input sequences of unbounded length by a definite memory machine. Conversely, every definite memory machine can be simulated by a recurrent network with contractive transition function. Hence, initialization with small weights induces an architectural bias into learning with recurrent neural networks. This bias might have benefits from the point of view of statistical learning theory: it emphasizes one possible region of the weight space where generalization ability can be formally proved. It is well known that standard recurrent neural networks are not distribution independent learnable in the probably approximately correct (PAC) sense if arbitrary precision and inputs are considered. We prove that recurrent networks with contractive transition function with a fixed contraction parameter fulfill the so-called distribution independent uniform convergence of empirical distances property and hence, unlike general recurrent networks, are distribution independent PAC learnable.

[1]  L. Glass,et al.  Oscillation and chaos in physiological control systems. , 1977, Science.

[2]  Eduardo D. Sontag,et al.  Feedforward Nets for Interpolation and Classification , 1992, J. Comput. Syst. Sci..

[3]  Barbara Hammer,et al.  On the Learnability of Recursive Data , 1999, Math. Control. Signals Syst..

[4]  M. Raijmakers Rethinking innateness: A connectionist perspective on development. , 1997 .

[5]  Peter Tiño,et al.  Architectural Bias in Recurrent Neural Networks - Fractal Analysis , 2002, ICANN.

[6]  Mikel L. Forcada,et al.  Simple Strategies to Encode Tree Automata in Sigmoid Recursive Neural Networks , 2001, IEEE Trans. Knowl. Data Eng..

[7]  Dana Ron,et al.  The power of amnesia: Learning probabilistic automata with variable memory length , 1996, Machine Learning.

[8]  Nick Chater,et al.  Toward a connectionist model of recursion in human linguistic performance , 1999 .

[9]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[10]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[11]  Ah Chung Tsoi,et al.  Rule inference for financial prediction using recurrent neural networks , 1997, Proceedings of the IEEE/IAFE 1997 Computational Intelligence for Financial Engineering (CIFEr).

[12]  Eduardo D. Sontag,et al.  Vapnik-Chervonenkis Dimension of Recurrent Neural Networks , 1998, Discret. Appl. Math..

[13]  A. Nadas,et al.  Estimation of probabilities in the language model of the IBM speech recognition system , 1984 .

[14]  John Shawe-Taylor,et al.  Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[15]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[16]  Peter Tiño,et al.  Learning and Extracting Initial Mealy Automata with a Modular Neural Network Model , 1995, Neural Comput..

[17]  Peter L. Bartlett,et al.  Learning in Neural Networks: Theoretical Foundations , 1999 .

[18]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[19]  Philip M. Long,et al.  Fat-shattering and the learnability of real-valued functions , 1994, COLT '94.

[20]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[21]  Yuichi Nakamura,et al.  Approximation of dynamical systems by continuous time recurrent neural networks , 1993, Neural Networks.

[22]  Anders Krogh,et al.  Two Methods for Improving Performance of a HMM and their Application for Gene Finding , 1997, ISMB.

[23]  C. Lee Giles,et al.  Constructing deterministic finite-state automata in recurrent neural networks , 1996, JACM.

[24]  Giovanni Soda,et al.  Bidirectional Dynamics for Protein Secondary Structure Prediction , 2001, Sequence Learning.

[25]  Marek Karpinski,et al.  Polynomial bounds for VC dimension of sigmoidal neural networks , 1995, STOC '95.

[26]  Hava T. Siegelmann,et al.  Analog computation via neural networks , 1993, [1993] The 2nd Israel Symposium on Theory and Computing Systems.

[27]  C. Lee Giles,et al.  Stable Encoding of Large Finite-State Automata in Recurrent Neural Networks with Sigmoid Discriminants , 1996, Neural Computation.

[28]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[29]  Mathukumalli Vidyasagar,et al.  A Theory of Learning and Generalization , 1997 .

[30]  Giovanni Soda,et al.  Unified Integration of Explicit Knowledge and Learning by Example in Recurrent Networks , 1995, IEEE Trans. Knowl. Data Eng..

[31]  Barbara Hammer,et al.  Generalization Ability of Folding Networks , 2001, IEEE Trans. Knowl. Data Eng..

[32]  Ron Sun,et al.  Introduction to Sequence Learning , 2001, Sequence Learning.

[33]  J. Kolen Recurrent Networks: State Machines Or Iterated Function Systems? , 1994 .

[34]  Yoshua Bengio,et al.  Input-output HMMs for sequence processing , 1996, IEEE Trans. Neural Networks.

[35]  Eduardo D. Sontag,et al.  Analog Neural Nets with Gaussian or Other Common Noise Distributions Cannot Recognize Arbitrary Regular Languages , 1999, Neural Computation.

[36]  P. Bühlmann,et al.  Variable Length Markov Chains: Methodology, Computing, and Software , 2004 .

[37]  Hava T. Siegelmann,et al.  On the Computational Power of Neural Nets , 1995, J. Comput. Syst. Sci..

[38]  Garrison W. Cottrell,et al.  Time-delay neural networks: representation and induction of finite-state machines , 1997, IEEE Trans. Neural Networks.

[39]  Wulfram Gerstner,et al.  Artificial Neural Networks — ICANN'97 , 1997, Lecture Notes in Computer Science.

[40]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[41]  Eduardo Sontag,et al.  Computational power of neural networks , 1995 .

[42]  C. Lee Giles,et al.  Learning a class of large finite state machines with a recurrent neural network , 1995, Neural Networks.

[43]  Peter Ti Markovian Architectural Bias of Recurrent Neural Networks , 2004 .

[44]  Kurt Hornik,et al.  Some new results on neural network approximation , 1993, Neural Networks.

[45]  Dana Ron,et al.  The Power of Amnesia , 1993, NIPS.

[46]  Mathukumalli Vidyasagar,et al.  A Theory of Learning and Generalization: With Applications to Neural Networks and Control Systems , 1997 .

[47]  Isabelle Guyon,et al.  Design of a linguistic postprocessor using variable memory length Markov models , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[48]  Alessio Micheli,et al.  Recursive self-organizing network models , 2004, Neural Networks.

[49]  Peter L. Bartlett,et al.  For Valid Generalization the Size of the Weights is More Important than the Size of the Network , 1996, NIPS.

[50]  Barbara Hammer Generalization of Elman Networks , 1997, ICANN.

[51]  Eduardo Sontag VC dimension of neural networks , 1998 .

[52]  Peter Tiño,et al.  Predicting the Future of Discrete Sequences from Fractal Representations of the Past , 2001, Machine Learning.

[53]  David Haussler,et al.  What Size Net Gives Valid Generalization? , 1989, Neural Computation.

[54]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[55]  John Hawkins,et al.  Improved access to sequential motifs: a note on the architectural bias of recurrent networks , 2005, IEEE Transactions on Neural Networks.

[56]  Ronald Saul,et al.  Discrete sequence prediction and its applications , 2005, Machine Learning.

[57]  Terrence J. Sejnowski,et al.  Parallel Networks that Learn to Pronounce English Text , 1987, Complex Syst..

[58]  Pekka Orponen,et al.  On the Effect of Analog Noise in Discrete-Time Analog Computations , 1996, Neural Computation.

[59]  Henrik Jacobsson,et al.  Rule Extraction from Recurrent Neural Networks: ATaxonomy and Review , 2005, Neural Computation.