论文信息 - Learning Simpler Language Models with the Differential State Framework

Learning Simpler Language Models with the Differential State Framework

Learning useful information across long time lags is a critical and difficult problem for temporal neural models in tasks such as language modeling. Existing architectures that address the issue are often complex and costly to train. The differential state framework (DSF) is a simple and high-performing design that unifies previously introduced gated neural models. DSF models maintain longer-term memory by learning to interpolate between a fast-changing data-driven representation and a slowly changing, implicitly stable state. Within the DSF framework, a new architecture is presented, the delta-RNN. This model requires hardly any more parameters than a classical, simple recurrent network. In language modeling at the word and character levels, the delta-RNN outperforms popular complex architectures, such as the long short-term memory (LSTM) and the gated recurrent unit (GRU), and, when regularized, performs comparably to several state-of-the-art baselines. At the subword level, the delta-RNN's performance is comparable to that of complex gated architectures.

[1] Jeffrey L. Elman,et al. Finding Structure in Time , 1990, Cogn. Sci..

[2] Michael I. Jordan. Attractor dynamics and parallelism in a connectionist sequential machine , 1990 .

[3] C. L. Giles,et al. Second-order recurrent neural networks for grammatical inference , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[4] Colin Giles,et al. Learning Context-free Grammars: Capabilities and Limitations of a Recurrent Neural Network with an External Stack Memory (cid:3) , 1992 .

[5] Boris Polyak,et al. Acceleration of stochastic approximation by averaging , 1992 .

[6] C. Lee Giles,et al. Learning and Extracting Finite State Automata with Second-Order Recurrent Neural Networks , 1992, Neural Computation.

[7] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[8] Srimat T. Chakradhar,et al. First-order versus second-order single-layer recurrent neural networks , 1994, IEEE Trans. Neural Networks.

[9] A. Roli. Artificial Neural Networks , 2012, Lecture Notes in Computer Science.

[10] Geoffrey E. Hinton,et al. Bayesian Learning for Neural Networks , 1995 .

[11] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[12] C. Lee Giles,et al. The Neural Network Pushdown Automaton: Architecture, Dynamics and Training , 1997, Summer School on Neural Networks.

[13] E. Newport,et al. Computation of Conditional Probability Statistics by 8-Month-Old Infants , 1998 .

[14] Jürgen Schmidhuber,et al. Recurrent nets that time and count , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[15] John Hale,et al. A Probabilistic Earley Parser as a Psycholinguistic Model , 2001, NAACL.

[16] Ah Chung Tsoi,et al. Noisy Time Series Prediction using Recurrent Neural Networks and Grammatical Inference , 2001, Machine Learning.

[17] Michael C. Mozer,et al. Neural net architectures for temporal sequence processing , 2007 .

[18] R. Levy. Expectation-based syntactic comprehension , 2008, Cognition.

[19] Reinhold Kliegl,et al. Parsing costs as predictors of reading difficulty: An evaluation using the Potsdam Sentence Corpus , 2008, Journal of Eye Movement Research.

[20] Yoshua Bengio,et al. Quadratic Features and Deep Architectures for Chunking , 2009, NAACL.

[21] Lukás Burget,et al. Recurrent neural network based language model , 2010, INTERSPEECH.

[22] Lukás Burget,et al. Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23] Ilya Sutskever,et al. SUBWORD LANGUAGE MODELING WITH NEURAL NETWORKS , 2011 .

[24] Christopher Potts,et al. Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[25] Vysoké Učení,et al. Statistical Language Models Based on Neural Networks , 2012 .

[26] Razvan Pascanu,et al. On the difficulty of training recurrent neural networks , 2012, ICML.

[27] Alex Graves,et al. Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[28] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[29] Jürgen Schmidhuber,et al. A Clockwork RNN , 2014, ICML.

[30] Yoshua Bengio,et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[31] Wojciech Zaremba,et al. Recurrent Neural Network Regularization , 2014, ArXiv.

[32] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[33] Jason Weston,et al. End-To-End Memory Networks , 2015, NIPS.

[34] Wojciech Zaremba,et al. An Empirical Exploration of Recurrent Network Architectures , 2015, ICML.