On the importance of sluggish state memory for learning long term dependency

The vanishing gradients problem inherent in Simple Recurrent Networks (SRN) trained with back-propagation, has led to a significant shift towards the use of Long Short-Term Memory (LSTM) and Echo State Networks (ESN), which overcome this problem through either second order error-carousel schemes or different learning algorithms, respectively.This paper re-opens the case for SRN-based approaches, by considering a variant, the Multi-recurrent Network (MRN). We show that memory units embedded within its architecture can ameliorate against the vanishing gradient problem, by providing variable sensitivity to recent and more historic information through layer- and self-recurrent links with varied weights, to form a so-called sluggish state-based memory.We demonstrate that an MRN, optimised with noise injection, is able to learn the long term dependency within a complex grammar induction task, significantly outperforming the SRN, NARX and ESN. Analysis of the internal representations of the networks, reveals that sluggish state-based representations of the MRN are best able to latch on to critical temporal dependencies spanning variable time delays, to maintain distinct and stable representations of all underlying grammar states. Surprisingly, the ESN was unable to fully learn the dependency problem, suggesting the major shift towards this class of models may be premature.

[1]  Claudia Ulbricht,et al.  Multi-Recurrent Networks for Traffic Forecasting , 1994, AAAI.

[2]  Jukka Heikkonen,et al.  Temporal sequence processing using recurrent SOM , 1998, 1998 Second International Conference. Knowledge-Based Intelligent Electronic Systems. Proceedings KES'98 (Cat. No.98EX111).

[3]  Nick Chater,et al.  Connectionist psycholinguistics in perspective , 2001 .

[4]  Garrison W. Cottrell,et al.  2007 Special Issue: Learning grammatical structure with Echo State Networks , 2007 .

[5]  Stefan L. Frank,et al.  Learn more by training less: systematicity in sentence processing by recurrent networks , 2006, Connect. Sci..

[6]  J. Elman Distributed Representations, Simple Recurrent Networks, And Grammatical Structure , 1991 .

[7]  Peter Tiño,et al.  Minimum Complexity Echo State Network , 2011, IEEE Transactions on Neural Networks.

[8]  John F. Kolen,et al.  Dynamical Recurrent Networks , 2001 .

[9]  Morten H. Christiansen,et al.  A Usage-Based Approach to Recursion in Sentence Processing , 2009 .

[10]  Lalit Gupta,et al.  Classification of temporal sequences via prediction using the simple recurrent neural network , 2000, Pattern Recognit..

[11]  Jonathan A. Tepper,et al.  Connectionist natural language parsing , 2002, Trends in Cognitive Sciences.

[12]  Peter Ford Dominey,et al.  A Model of Corticostriatal Plasticity for Learning Oculomotor Associations and Sequences , 1995, Journal of Cognitive Neuroscience.

[13]  Herbert Jaeger,et al.  The''echo state''approach to analysing and training recurrent neural networks , 2001 .

[14]  John A. Bullinaria,et al.  Analyzing the Internal Representations of Trained Neural Networks , 1997 .

[15]  Janet Wiles,et al.  Context-free and context-sensitive dynamics in recurrent neural networks , 2000, Connect. Sci..

[16]  Scott E. Fahlman,et al.  The Recurrent Cascade-Correlation Architecture , 1990, NIPS.

[17]  DeLiang Wang,et al.  On Temporal Generalization of Simple Recurrent Networks , 1996, Neural Networks.

[18]  Herbert Jaeger,et al.  A tutorial on training recurrent neural networks , covering BPPT , RTRL , EKF and the " echo state network " approach - Semantic Scholar , 2005 .

[19]  Peter Ford Dominey,et al.  Recurrent temporal networks and language acquisition—from corticostriatal neurophysiology to reservoir computing , 2013, Front. Psychol..

[20]  N. Cowan The magical number 4 in short-term memory: A reconsideration of mental storage capacity , 2001, Behavioral and Brain Sciences.

[21]  Huan Liu,et al.  Understanding Neural Networks via Rule Extraction , 1995, IJCAI.

[22]  James L. McClelland,et al.  Graded state machines: The representation of temporal contingencies in simple recurrent networks , 1991, Machine Learning.

[23]  Razvan Pascanu,et al.  A neurodynamical model for working memory , 2011, Neural Networks.

[24]  James L. McClelland,et al.  Finite State Automata and Simple Recurrent Networks , 1989, Neural Computation.

[25]  Bo Cartling,et al.  On the implicit acquisition of a context-free grammar by a simple recurrent neural network , 2008, Neurocomputing.

[26]  F. Gers,et al.  Long short-term memory in recurrent neural networks , 2001 .

[27]  P J Webros BACKPROPAGATION THROUGH TIME: WHAT IT DOES AND HOW TO DO IT , 1990 .

[28]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[29]  Peter Ford Dominey,et al.  Real-Time Parallel Processing of Grammatical Structure in the Fronto-Striatal System: A Recurrent Network Simulation Study Using Reservoir Computing , 2013, PloS one.

[30]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[31]  C. Lee Giles,et al.  How embedded memory in recurrent neural network architectures helps learning long-term temporal dependencies , 1998, Neural Networks.

[32]  A. L. Sumi,et al.  Language as a Dynamical System : Sensitive Dependence on Error Conditions , 2004 .

[33]  Georg Dorffner,et al.  Neural Networks for Time Series Processing , 1996 .

[34]  A. Clark Whatever next? Predictive brains, situated agents, and the future of cognitive science. , 2013, The Behavioral and brain sciences.

[35]  Michael I. Jordan Attractor dynamics and parallelism in a connectionist sequential machine , 1990 .

[36]  S. Levinson,et al.  The myth of language universals: language diversity and its importance for cognitive science. , 2009, The Behavioral and brain sciences.

[37]  G. Marcus Can connectionism save constructivism? , 1998, Cognition.

[38]  Jürgen Schmidhuber,et al.  Kalman filters improve LSTM network performance in problems unsolvable by traditional recurrent nets , 2003, Neural Networks.

[39]  Igor Farkas,et al.  Syntactic systematicity in sentence processing with a recurrent self-organizing network , 2008, Neurocomputing.

[40]  Eduardo Sontag,et al.  Turing computability with neural nets , 1991 .

[41]  Geoffrey E. Hinton,et al.  Temporal-Kernel Recurrent Neural Networks , 2010, Neural Networks.

[42]  Jonathan A. Tepper,et al.  A corpus-based connectionist architecture for large-scale natural language parsing , 2002, Connect. Sci..

[43]  A. Friederici The cortical language circuit: from auditory perception to sentence comprehension , 2012, Trends in Cognitive Sciences.

[44]  Jonathan A. Tepper,et al.  Does money matter in inflation forecasting , 2009 .

[45]  John G. Harris,et al.  Minimum mean squared error time series classification using an echo state network prediction model , 2006, 2006 IEEE International Symposium on Circuits and Systems.

[46]  Jude W. Shavlik,et al.  Using Sampling and Queries to Extract Rules from Trained Neural Networks , 1994, ICML.