Large Vocabulary SOUL Neural Network Language Models

This paper presents continuation of research on Structured OUtput Layer Neural Network language models (SOUL NNLM) for automatic speech recognition. As SOUL NNLMs allow estimating probabilities for all in-vocabulary words and not only for those pertaining to a limited shortlist, we investigate its performance on a large-vocabulary task. Significant improvements both in perplexity and word error rate over conventional shortlist-based NNLMs are shown on a challenging Arabic GALE task characterized by a recognition vocabulary of about 300k entries. A new training scheme is proposed for SOUL NNLMs that is based on separate training of the outof-shortlist part of the output layer. It enables using more data at each iteration of a neural network without any considerable slow-down in training and brings additional improvements in speech recognition performance.

[1]  Alexandre Allauzen,et al.  Structured Output Layer neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Jean-Luc Gauvain,et al.  Investigating morphological decomposition for transcription of Arabic broadcast news and broadcast conversation data , 2008, INTERSPEECH.

[3]  Holger Schwenk,et al.  Continuous space language models , 2007, Comput. Speech Lang..

[4]  Geoffrey E. Hinton,et al.  A Scalable Hierarchical Distributed Language Model , 2008, NIPS.

[5]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[6]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[7]  Alexandre Allauzen,et al.  Training Continuous Space Language Models: Some Practical Issues , 2010, EMNLP.

[8]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[9]  Jean-Luc Gauvain,et al.  Improved models for Mandarin speech-to-text transcription , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Pavel Matejka,et al.  Towards Lower Error Rates in Phoneme Recognition , 2004, TSD.

[11]  Lukás Burget,et al.  Extensions of recurrent neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Mark J. F. Gales,et al.  Morphological analysis and decomposition for Arabic speech-to-text systems , 2009, INTERSPEECH.

[13]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[14]  Jean-Luc Gauvain,et al.  Connectionist language modeling for large vocabulary continuous speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.