论文信息 - DARTS: Dialectal Arabic Transcription System

DARTS: Dialectal Arabic Transcription System

We present the speech to text transcription system, called DARTS, for low resource Egyptian Arabic dialect. We analyze the following; transfer learning from high resource broadcast domain to low-resource dialectal domain and semi-supervised learning where we use in-domain unlabeled audio data collected from YouTube. Key features of our system are: A deep neural network acoustic model that consists of a front end Convolutional Neural Network (CNN) followed by several layers of Time Delayed Neural Network (TDNN) and Long-Short Term Memory Recurrent Neural Network (LSTM); sequence discriminative training of the acoustic model; n-gram and recurrent neural network language model for decoding and N-best list rescoring. We show that a simple transfer learning method can achieve good results. The results are further improved by using unlabeled data from YouTube in a semi-supervised setup. Various systems are combined to give the final system that achieves the lowest word error on on the community standard Egyptian-Arabic speech dataset (MGB-3).

[1] Steve Renals,et al. Word Error Rate Estimation for Speech Recognition: e-WER , 2018, ACL.

[2] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[3] Stephan Vogel,et al. Speech recognition challenge in the wild: Arabic MGB-3 , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[4] Yajie Miao,et al. EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[5] Lukás Burget,et al. Sequence-discriminative training of deep neural networks , 2013, INTERSPEECH.

[6] Kaare Brandt Petersen,et al. Mel Frequency Cepstral Coefficients: An Evaluation of Robustness of MP3 Encoded Music , 2006, ISMIR.

[7] Gabriel Synnaeve,et al. Wav2Letter++: A Fast Open-source Speech Recognition System , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Mark J. F. Gales,et al. The MGB challenge: Evaluating multi-genre broadcast media recognition , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[9] Ahmed Mohamed Abdel Maksoud Ali,et al. Multi-dialect Arabic broadcast speech recognition , 2018 .

[10] Yoshua Bengio,et al. Convolutional networks for images, speech, and time series , 1998 .

[11] Mikko Kurimo,et al. Aalto system for the 2017 Arabic multi-genre broadcast challenge , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[12] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[13] Yiming Wang,et al. Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[14] Steve Renals,et al. Revisiting hybrid and GMM-HMM system combination techniques , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15] James R. Glass,et al. The MGB-2 challenge: Arabic multi-dialect broadcast media recognition , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[16] Chong Wang,et al. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[17] Sanjeev Khudanpur,et al. Audio augmentation for speech recognition , 2015, INTERSPEECH.

[18] Yu Wang,et al. PHONETIC AND GRAPHEMIC SYSTEMS FOR MULTI-GENRE BROADCAST TRANSCRIPTION , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Sanjeev Khudanpur,et al. Semi-Supervised Training of Acoustic Models Using Lattice-Free MMI , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[21] Sameer Khurana,et al. QCRI advanced transcription system (QATS) for the Arabic Multi-Dialect Broadcast media recognition: MGB-2 challenge , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[22] Shinji Watanabe,et al. Joint CTC-attention based end-to-end speech recognition using multi-task learning , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23] Vysoké Učení,et al. Statistical Language Models Based on Neural Networks , 2012 .

[24] James R. Glass,et al. Convolutional Neural Networks and Language Embeddings for End-to-End Dialect Recognition , 2018, Odyssey.

[25] J.H.L. Hansen,et al. High resolution speech feature parametrization for monophone-based stressed speech recognition , 2000, IEEE Signal Processing Letters.

[26] Sanjeev Khudanpur,et al. A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.