The GTM-UVIGO System for Albayzin 2018 Speech-to-Text Evaluation

This paper describes the Speech-to-Text system developed by the Multimedia Technologies Group (GTM) of the atlanTTic research center at the University of Vigo, for the Albayzin Speech-to-Text Challenge (S2T) organized in the Iberspeech 2018 conference. The large vocabulary automatic speech recognition system is built using the Kaldi toolkit. It uses an hybrid Deep Neural Network Hidden Markov Model (DNN-HMM) for acoustic modeling, and a rescoring of a trigram based wordlattices, obtained in a first decoding stage, with a fourgram language model or a language model based on a recurrent neural network. The system was evaluated only on the open set training condition.

[1]  Hermann Ney,et al.  Lattice decoding and rescoring with long-Span neural network language models , 2014, INTERSPEECH.

[2]  Tara N. Sainath,et al.  Deep Neural Network Language Models , 2012, WLM@NAACL-HLT.

[3]  Mauro Cettolo,et al.  Efficient audio segmentation algorithms based on the BIC , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[4]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[5]  Yiming Wang,et al.  A Pruned Rnnlm Lattice-Rescoring Algorithm for Automatic Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[7]  Carmen García-Mateo,et al.  Transcrigal: A Bilingual System for Automatic Indexing of Broadcast News , 2004, LREC.

[8]  Yu Wang,et al.  Future word contexts in neural network language models , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[9]  Carmen García-Mateo,et al.  TC-STAR 2006 Automatic Speech Recognition Evaluation: The UVIGO System , 2006 .

[10]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[11]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[12]  Joshua Goodman,et al.  A bit of progress in language modeling , 2001, Comput. Speech Lang..

[13]  James H. Martin,et al.  Speech and Language Processing, 2nd Edition , 2008 .

[14]  Paula Lopez-Otero,et al.  GTM-UVigo System for Albayzin 2014 Audio Segmentation Evaluation , 2014 .

[15]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[16]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.