Character-level language modeling with hierarchical recurrent neural networks

Recurrent neural network (RNN) based character-level language models (CLMs) are extremely useful for modeling out-of-vocabulary words by nature. However, their performance is generally much worse than the word-level language models (WLMs), since CLMs need to consider longer history of tokens to properly predict the next one. We address this problem by proposing hierarchical RNN architectures, which consist of multiple modules with different timescales. Despite the multi-timescale structures, the input and output layers operate with the character-level clock, which allows the existing RNN CLM training approaches to be directly applicable without any modifications. Our CLM models show better perplexity than Kneser-Ney (KN) 5-gram WLMs on the One Billion Word Benchmark with only 2% of parameters. Also, we present real-time character-level end-to-end speech recognition examples on the Wall Street Journal (WSJ) corpus, where replacing traditional mono-clock RNN CLMs with the proposed models results in better recognition accuracies even though the number of parameters are reduced to 30%.

[1]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[2]  Geoffrey E. Hinton,et al.  Generating Text with Recurrent Neural Networks , 2011, ICML.

[3]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[4]  P J Webros BACKPROPAGATION THROUGH TIME: WHAT IT DOES AND HOW TO DO IT , 1990 .

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Jing Peng,et al.  An Efficient Gradient-Based Algorithm for On-Line Training of Recurrent Network Trajectories , 1990, Neural Computation.

[7]  Tie-Yan Liu,et al.  LightRNN: Memory and Computation-Efficient Recurrent Neural Networks , 2016, NIPS.

[8]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[9]  Jürgen Schmidhuber,et al.  Learning Precise Timing with LSTM Recurrent Networks , 2003, J. Mach. Learn. Res..

[10]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[11]  Pradeep Dubey,et al.  BlackOut: Speeding up Recurrent Neural Network Language Models With Very Large Vocabularies , 2015, ICLR.

[12]  Vysoké Učení,et al.  Statistical Language Models Based on Neural Networks , 2012 .

[13]  Wang Ling,et al.  Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation , 2015, EMNLP.

[14]  Benjamin Schrauwen,et al.  Training and Analysing Deep Recurrent Neural Networks , 2013, NIPS.

[15]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[16]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[17]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[18]  Alex Graves,et al.  Connectionist Temporal Classification , 2012 .

[19]  Joris Pelemans,et al.  Sparse non-negative matrix language modeling for skip-grams , 2015, INTERSPEECH.

[20]  Geoffrey Zweig,et al.  Context dependent recurrent neural network language model , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[21]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[22]  Wonyong Sung,et al.  Single stream parallelization of generalized LSTM-like RNNs on a GPU , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[24]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.

[25]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[26]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[27]  Y. Nesterov A method for unconstrained convex minimization problem with the rate of convergence o(1/k^2) , 1983 .

[28]  Wang Ling,et al.  Character-based Neural Machine Translation , 2015, ArXiv.

[29]  Kyunghyun Cho,et al.  Gated Word-Character Recurrent Language Model , 2016, EMNLP.

[30]  Wonyong Sung,et al.  Character-level incremental speech recognition with recurrent neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[32]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[33]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.