LIUM ASR systems for the 2016 Multi-Genre Broadcast Arabic challenge

This paper describes the automatic speech recognition (ASR) systems developed by LIUM in the framework of the 2016 Multi-Genre Broadcast (MGB-2) Challenge in the Arabic language. LIUM participated in the first of the two proposed tasks, namely the speech-to-text transcription of Aljazeera recordings. We present the approaches and details found in our systems, as well as our results in the evaluation campaign: the primary LIUM ASR system attained the second position. The main aspects come from the use of GMM-derived features for training a DNN, combined with the use of time-delay neural networks for acoustic models, the use of two different approaches in order to automatically phonetize Arabic words, and finally, the training data selection strategy for acoustic and language models.

[1]  James R. Glass,et al.  The MGB-2 challenge: Arabic multi-dialect broadcast media recognition , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[2]  Lukás Burget,et al.  Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[3]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[4]  Nizar Habash,et al.  Improving the Arabic Pronunciation Dictionary for Phone and Word Recognition with Linguistically-Based Pronunciation Rules , 2009, HLT-NAACL.

[5]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[6]  Natalia A. Tomashenko,et al.  GMM-derived features for effective unsupervised adaptation of deep neural network acoustic models , 2015, INTERSPEECH.

[7]  Philipp Koehn,et al.  Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[8]  James R. Glass,et al.  A complete KALDI recipe for building Arabic speech recognition systems , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[9]  Sylvain Meignier,et al.  LIUM SPKDIARIZATION: AN OPEN SOURCE TOOLKIT FOR DIARIZATION , 2010 .

[10]  Yannick Estève,et al.  On the Use of Gaussian Mixture Model Framework to Improve Speaker Adaptation of Deep Neural Network Acoustic Models , 2016, INTERSPEECH.

[11]  Nizar Habash,et al.  MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic , 2014, LREC.

[12]  Gunnar Evermann,et al.  Posterior probability decoding, confidence estimation and system combination , 2000 .

[13]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[14]  Natalia A. Tomashenko,et al.  Speaker adaptation of context dependent deep neural networks based on MAP-adaptation and GMM-derived feature processing , 2014, INTERSPEECH.

[15]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[16]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .