Structured Output Layer Neural Network Language Models for Speech Recognition

This paper extends a novel neural network language model (NNLM) which relies on word clustering to structure the output vocabulary: Structured OUtput Layer (SOUL) NNLM. This model is able to handle arbitrarily-sized vocabularies, hence dispensing with the need for shortlists that are commonly used in NNLMs. Several softmax layers replace the standard output layer in this model. The output structure depends on the word clustering which is based on the continuous word representation determined by the NNLM. Mandarin and Arabic data are used to evaluate the SOUL NNLM accuracy via speech-to-text experiments. Well tuned speech-to-text systems (with error rates around 10%) serve as the baselines. The SOUL model achieves consistent improvements over a classical shortlist NNLM both in terms of perplexity and recognition accuracy for these two languages that are quite different in terms of their internal structure and recognition vocabulary size. An enhanced training scheme is proposed that allows more data to be used at each training iteration of the neural network.

[1]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[2]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[3]  Joshua Goodman,et al.  Classes for fast maximum entropy training , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[4]  Jean-Luc Gauvain,et al.  Connectionist language modeling for large vocabulary continuous speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[6]  Ahmad Emami,et al.  A Neural Syntactic Language Model , 2005, Machine Learning.

[7]  Holger Schwenk,et al.  Continuous space language models , 2007, Comput. Speech Lang..

[8]  Peng Xu,et al.  Random forests and the data sparseness problem in language modeling , 2007, Comput. Speech Lang..

[9]  Jean-Luc Gauvain,et al.  Investigating morphological decomposition for transcription of Arabic broadcast news and broadcast conversation data , 2008, INTERSPEECH.

[10]  Geoffrey E. Hinton,et al.  A Scalable Hierarchical Distributed Language Model , 2008, NIPS.

[11]  Jean-Luc Gauvain,et al.  MODELING CHARACTERS VERSUS WORDS FOR MANDARIN SPEECH RECOGNITION , 2009 .

[12]  Jean-Luc Gauvain,et al.  Automatic Speech-to-Text Transcription in Arabic , 2009, TALIP.

[13]  Mark J. F. Gales,et al.  Morphological analysis and decomposition for Arabic speech-to-text systems , 2009, INTERSPEECH.

[14]  Ahmad Emami,et al.  Morphological and syntactic features for Arabic speech recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Stanley F. Chen,et al.  Enhanced word classing for model M , 2010, INTERSPEECH.

[16]  Alexandre Allauzen,et al.  Training Continuous Space Language Models: Some Practical Issues , 2010, EMNLP.

[17]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[18]  Ahmad Emami,et al.  Multi-class Model M , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Alexandre Allauzen,et al.  Structured Output Layer neural network language model , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Tomas Mikolov,et al.  RNNLM - Recurrent Neural Network Language Modeling Toolkit , 2011 .

[21]  Jean-Luc Gauvain,et al.  Improved models for Mandarin speech-to-text transcription , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Lukás Burget,et al.  Strategies for training large scale neural network language models , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[23]  Alexandre Allauzen,et al.  Large Vocabulary SOUL Neural Network Language Models , 2011, INTERSPEECH.

[24]  Hermann Ney,et al.  Performance analysis of Neural Networks in combination with n-gram language models , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).