Strategies for training large scale neural network language models

We describe how to effectively train neural network based language models on large data sets. Fast convergence during training and better overall performance is observed when the training data are sorted by their relevance. We introduce hash-based implementation of a maximum entropy model, that can be trained as a part of the neural network model. This leads to significant reduction of computational complexity. We achieved around 10% relative reduction of word error rate on English Broadcast News speech recognition task, against large 4-gram model trained on 400M tokens.

[1]  J. Elman Learning and development in neural networks: the importance of starting small , 1993, Cognition.

[2]  Ronald Rosenfeld,et al.  A maximum entropy approach to adaptive statistical language modelling , 1996, Comput. Speech Lang..

[3]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[4]  Wei Xu,et al.  Can artificial neural networks learn language models? , 2000, INTERSPEECH.

[5]  Joshua Goodman,et al.  Classes for fast maximum entropy training , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[6]  Jean-Luc Gauvain,et al.  Training Neural Network Language Models on Very Large Corpora , 2005, HLT.

[7]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[8]  Holger Schwenk,et al.  Continuous space language models , 2007, Comput. Speech Lang..

[9]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[10]  Stanley F. Chen,et al.  Shrinking Exponential Language Models , 2009, NAACL.

[11]  Bhuvana Ramabhadran,et al.  Scaling shrinkage-based language models , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[12]  Mikko Kurimo,et al.  Efficient estimation of maximum entropy language models with n-gram features: an SRILM extension , 2010, INTERSPEECH.

[13]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[14]  Mary P. Harper,et al.  Model combination for Speech Recognition using Empirical Bayes Risk minimization , 2010, 2010 IEEE Spoken Language Technology Workshop.

[15]  Brian Kingsbury,et al.  The IBM Attila speech recognition toolkit , 2010, 2010 IEEE Spoken Language Technology Workshop.

[16]  Kenneth Ward Church,et al.  A Fast Re-scoring Strategy to Capture Long-Distance Dependencies , 2011, EMNLP.

[17]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Geoffrey Zweig,et al.  Speech recognitionwith segmental conditional random fields: A summary of the JHU CLSP 2010 Summer Workshop , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Alexandre Allauzen,et al.  Structured Output Layer neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Sanjeev Khudanpur,et al.  Efficient Subsampling for Training Complex Language Models , 2011, EMNLP.

[21]  Lukás Burget,et al.  Empirical Evaluation and Combination of Advanced Language Modeling Techniques , 2011, INTERSPEECH.