Improving the Automatic Speech Recognition through the improvement of Laguage Models

Language models are one of the pillars on which the performance of automatic speech recognition systems are based. Statistical language models that use word sequence probabilities (n-grams) are the most common, although deep neural networks are also now beginning to be applied here. This is possible due to the increases in computation power and improvements in algorithms. In this paper, the impact that language models have on the results of recognition is addressed in the following situations: 1) when they are adjusted to the work environment of the final application, and 2) when their complexity grows due to increases in the order of the n-gram models or by the application of deep neural networks. Specifically, an automatic speech recognition system with different language models is applied to audio recordings, these corresponding to three experimental frameworks: formal orality, talk on newscasts, and TED talks in Galician. Experimental results showed that improving the quality of language models yields improvements in recognition performance.

[1]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[2]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.

[3]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing , 2000 .

[4]  Carmen García-Mateo,et al.  Transcrigal: A Bilingual System for Automatic Indexing of Broadcast News , 2004, LREC.

[5]  Tomas Mikolov,et al.  RNNLM - Recurrent Neural Network Language Modeling Toolkit , 2011 .

[6]  Hermann Ney,et al.  Lattice decoding and rescoring with long-Span neural network language models , 2014, INTERSPEECH.

[7]  Joshua Goodman,et al.  A bit of progress in language modeling , 2001, Comput. Speech Lang..

[8]  Yu Wang,et al.  Future word contexts in neural network language models , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[9]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[10]  Lluís Padró,et al.  OpenTrad: Traducción automática de código abierto para las lenguas del Estado español , 2006, Proces. del Leng. Natural.

[11]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[12]  Yiming Wang,et al.  A Pruned Rnnlm Lattice-Rescoring Algorithm for Automatic Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[14]  Carmen García-Mateo,et al.  TC-STAR 2006 Automatic Speech Recognition Evaluation: The UVIGO System , 2006 .

[15]  Carmen García-Mateo,et al.  Estudio sobre el impacto del corpus de entrenamiento del modelo de lenguaje en las prestaciones de un reconocedor de habla , 2018, Proces. del Leng. Natural.

[16]  James H. Martin,et al.  Speech and Language Processing, 2nd Edition , 2008 .

[17]  Tara N. Sainath,et al.  Deep Neural Network Language Models , 2012, WLM@NAACL-HLT.

[18]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..