论文信息 - QCRI advanced transcription system (QATS) for the Arabic Multi-Dialect Broadcast media recognition: MGB-2 challenge

QCRI advanced transcription system (QATS) for the Arabic Multi-Dialect Broadcast media recognition: MGB-2 challenge

In this paper, we describe Qatar Computing Research Institute's (QCRI) speech transcription system for the 2016 Dialectal Arabic Multi-Genre Broadcast (MGB-2) challenge. MGB-2 is a controlled evaluation using 1,200 hours audio with lightly supervised transcription Our system which was a combination of three purely sequence trained recognition systems, achieved the lowest WER of 14.2% among the nine participating teams. Key features of our transcription system are: purely sequence trained acoustic models using the recently introduced Lattice free Maximum Mutual Information (LF-MMI) modeling framework; Language model rescoring using a four-gram and Recurrent Neural Network with Max- Ent connections (RNNME) language models; and system combination using Minimum Bayes Risk (MBR) decoding criterion. The whole system is built using kaldi speech recognition toolkit.

Sameer Khurana | Ahmed M. Ali | Ahmed M. Ali | Sameer Khurana

[1] Richard M. Schwartz,et al. Practical Implementations of Speaker-Adaptive Training , 1997 .

[2] Andrew W. Senior,et al. Fast and accurate recurrent neural network acoustic models for speech recognition , 2015, INTERSPEECH.

[3] Mark J. F. Gales,et al. The MGB challenge: Evaluating multi-genre broadcast media recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[4] Nizar Habash,et al. MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[5] Yiming Wang,et al. Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[6] Andrew W. Senior,et al. Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[7] Florin Curelaru,et al. Front-End Factor Analysis For Speaker Verification , 2018, 2018 International Conference on Communications (COMM).

[8] Sanjeev Khudanpur,et al. A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[9] Lukás Burget,et al. Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[10] Sanjeev Khudanpur,et al. Audio augmentation for speech recognition , 2015, INTERSPEECH.

[11] Jan Cernocký,et al. Improved feature processing for deep neural networks , 2013, INTERSPEECH.

[12] Stephan Vogel,et al. Advances in dialectal Arabic speech recognition: a study using Twitter to improve Egyptian ASR , 2014, IWSLT.

[13] M. Picheny,et al. Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[14] Mark J. F. Gales,et al. CUED-RNNLM — An open-source toolkit for efficient training and evaluation of recurrent neural network language models , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Tomas Mikolov,et al. RNNLM - Recurrent Neural Network Language Modeling Toolkit , 2011 .

[16] Ramesh A. Gopinath,et al. Maximum likelihood modeling with Gaussian distributions for classification , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[17] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[18] Andreas Stolcke,et al. SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[19] Mark J. F. Gales,et al. Maximum likelihood linear transformations for HMM-based speech recognition , 1998, Comput. Speech Lang..