Temporal hierarchies in multilayer gated recurrent neural networks for language models

Representing multiple compositions of human language has been a difficult task due to the complex hierarchical and compositional nature of language. Hierarchical structures are one of the architectures which can be used to capture such compositionalities. In this paper, we introduce temporal hierarchies to the Neural Language Model (NLM) with the help of a Deep Gated Recurrent Neural Network with adaptive timescales to help represent multiple compositions of language. We demonstrate that by representing multiple compositions of language in a deep recurrent neural network architecture, we can improve the performance of Language Models without complex hierarchical architectures. We report the performance of the proposed model using the popular Penn Treebank (PTB) dataset. The results show that by using the multiple timescale concept in an NLM, we can achieve better perplexities compared to the existing baselines.

[1]  Matthew M Botvinick,et al.  Multilevel structure in behaviour and in the brain: a model of Fuster's hierarchy , 2007, Philosophical Transactions of the Royal Society B: Biological Sciences.

[2]  Péter Érdi,et al.  Artificial Neural Networks and Machine Learning – ICANN 2012 , 2012, Lecture Notes in Computer Science.

[3]  Jun Tani,et al.  Emergence of Functional Hierarchy in a Multiple Timescale Neural Network Model: A Humanoid Robot Experiment , 2008, PLoS Comput. Biol..

[4]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[5]  Wei-Chen Cheng,et al.  Language modeling with sum-product networks , 2014, INTERSPEECH.

[6]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[7]  Edward T. Bullmore,et al.  Neuroinformatics Original Research Article , 2022 .

[8]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[9]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[10]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[11]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[12]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[13]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[14]  Jason Weston,et al.  A Neural Attention Model for Sentence Summarization , 2015 .

[15]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[16]  Stefan Wermter,et al.  Adaptive Learning of Linguistic Hierarchy in a Multiple Timescale Recurrent Neural Network , 2012, ICANN.

[17]  Minho Lee,et al.  Towards Abstraction from Extraction: Multiple Timescale Gated Recurrent Unit for Summarization , 2016, Rep4NLP@ACL.

[18]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[19]  Ming Zhou,et al.  Hierarchical Recurrent Neural Network for Document Modeling , 2015, EMNLP.

[20]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[21]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[22]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[23]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.

[24]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[25]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[26]  P J Webros BACKPROPAGATION THROUGH TIME: WHAT IT DOES AND HOW TO DO IT , 1990 .

[27]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[29]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[30]  Phil Blunsom,et al.  Recurrent Continuous Translation Models , 2013, EMNLP.

[31]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[32]  D. Poeppel,et al.  Cortical Tracking of Hierarchical Linguistic Structures in Connected Speech , 2015, Nature Neuroscience.

[33]  Geoffrey Zweig,et al.  Context dependent recurrent neural network language model , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[34]  Jason Weston,et al.  A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[35]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[36]  Jun Tani,et al.  Adaptive Motor Primitive and Sequence Formation in a Hierarchical Recurrent Neural Network , 2004 .

[37]  Mike E. Davies,et al.  IEEE International Conference on Acoustics Speech and Signal Processing , 2008 .

[38]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.