Neural networks with small weights implement finite memory machines

Recent experimental studies indicate that recurrent networks initialized with ‘small’ weights are inherently biased towards finite memory machines [37]. This paper establishes a theoretical counterpart: we prove that recurrent networks with small weights or contractive transition function, respectively, can be approximated arbitrarily well on input sequences of unbounded length by a finite memory machine. Conversely, every finite memory machine can be simulated by a recurrent network with contractive transition function. Hence initialization with small weights induces an architectural bias into learning with recurrent neural networks. This bias has benefits from the point of view of statistical learning theory: it emphasizes regions of the weight space where good generalization can be expected. It is well known that standard recurrent neural networks are not distribution independent learnable in the PAC sense. We prove that recurrent networks with contractive transition function with a fixed parameter of the contraction fulfill the so-called distribution independent UCED property and hence are distribution independent PAC-learnable unlike general recurrent networks.

[1]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[2]  Mathukumalli Vidyasagar,et al.  A Theory of Learning and Generalization , 1997 .

[3]  Barbara Hammer,et al.  Generalization Ability of Folding Networks , 2001, IEEE Trans. Knowl. Data Eng..

[4]  Eduardo D. Sontag,et al.  Vapnik-Chervonenkis Dimension of Recurrent Neural Networks , 1997, Discret. Appl. Math..

[5]  B. Uhlmann,et al.  Variable length Markov chains , 1999 .

[6]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[7]  Giovanni Soda,et al.  Bidirectional Dynamics for Protein Secondary Structure Prediction , 2001, Sequence Learning.

[8]  Eduardo Sontag,et al.  Computational power of neural networks , 1995 .

[9]  Terrence J. Sejnowski,et al.  Parallel Networks that Learn to Pronounce English Text , 1987, Complex Syst..

[10]  Marek Karpinski,et al.  Polynomial bounds for VC dimension of sigmoidal neural networks , 1995, STOC '95.

[11]  A. Nadas,et al.  Estimation of probabilities in the language model of the IBM speech recognition system , 1984 .

[12]  John Shawe-Taylor,et al.  Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[13]  Dana Ron,et al.  The Power of Amnesia , 1993, NIPS.

[14]  Kurt Hornik,et al.  Some new results on neural network approximation , 1993, Neural Networks.

[15]  Ronald Saul,et al.  Discrete Sequence Prediction and Its Applications , 2004, Machine Learning.

[16]  Ron Sun,et al.  Introduction to Sequence Learning , 2001, Sequence Learning.

[17]  Teuvo Kohonen,et al.  Self-Organizing Maps , 2010 .

[18]  Barbara Hammer Generalization of Elman Networks , 1997, ICANN.

[19]  Eduardo Sontag VC dimension of neural networks , 1998 .

[20]  Ah Chung Tsoi,et al.  Rule inference for financial prediction using recurrent neural networks , 1997, Proceedings of the IEEE/IAFE 1997 Computational Intelligence for Financial Engineering (CIFEr).

[21]  Eduardo D. Sontag,et al.  Vapnik-Chervonenkis Dimension of Recurrent Neural Networks , 1998, Discret. Appl. Math..

[22]  Peter Tiño,et al.  Predicting the Future of Discrete Sequences from Fractal Representations of the Past , 2001, Machine Learning.

[23]  David Haussler,et al.  What Size Net Gives Valid Generalization? , 1989, Neural Computation.

[24]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[25]  Hava T. Siegelmann,et al.  Analog computation via neural networks , 1993, [1993] The 2nd Israel Symposium on Theory and Computing Systems.

[26]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[27]  Isabelle Guyon,et al.  Design of a linguistic postprocessor using variable memory length Markov models , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[28]  Yoshua Bengio,et al.  Input-output HMMs for sequence processing , 1996, IEEE Trans. Neural Networks.

[29]  J. Elman,et al.  Rethinking Innateness: A Connectionist Perspective on Development , 1996 .

[30]  Hava T. Siegelmann,et al.  On the Computational Power of Neural Nets , 1995, J. Comput. Syst. Sci..

[31]  Barbara Hammer,et al.  On the Learnability of Recursive Data , 1999, Math. Control. Signals Syst..

[32]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[33]  Philip M. Long,et al.  Fat-shattering and the learnability of real-valued functions , 1994, COLT '94.

[34]  Peter Ti Markovian Architectural Bias of Recurrent Neural Networks , 2004 .

[35]  Yuichi Nakamura,et al.  Approximation of dynamical systems by continuous time recurrent neural networks , 1993, Neural Networks.

[36]  Peter Tiño,et al.  Markovian architectural bias of recurrent neural networks , 2004, IEEE Transactions on Neural Networks.

[37]  Anders Krogh,et al.  Two Methods for Improving Performance of a HMM and their Application for Gene Finding , 1997, ISMB.

[38]  Eduardo D. Sontag,et al.  Feedforward Nets for Interpolation and Classification , 1992, J. Comput. Syst. Sci..

[39]  J. Kolen Recurrent Networks: State Machines Or Iterated Function Systems? , 1994 .

[40]  Nick Chater,et al.  Toward a connectionist model of recursion in human linguistic performance , 1999 .