Development of a Vietnamese Large Vocabulary Continuous Speech Recognition System under Noisy Conditions

In this paper, we first present our effort to collect a 500-hour corpus for Vietnamese read speech. After that, various techniques such as data augmentation, recurrent neural network language model rescoring, language model adaptation, bottleneck feature, system combination are applied to build the speech recognition system. Our final system achieves a low word error rate at 6.9% on the noisy test set.

[1]  Dirk Van Compernolle,et al.  Vietnamese Automatic Speech Recognition: The FLaVoR Approach , 2006, ISCSLP.

[2]  Haizhou Li,et al.  Context-dependent phone mapping for LVCSR of under-resourced languages , 2013, INTERSPEECH.

[3]  Jan Cernocký,et al.  Probabilistic and Bottle-Neck Features for LVCSR of Meetings , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[4]  Ngoc Thang Vu,et al.  Vietnamese large vocabulary continuous speech recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[5]  Yongqiang Wang,et al.  Efficient lattice rescoring using recurrent neural network language models , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Tuan Nguyen,et al.  Advances in Acoustic Modeling for Vietnamese LVCSR , 2009, 2009 International Conference on Asian Language Processing.

[7]  Mark Hasegawa-Johnson,et al.  Multi-Task Learning Using Mismatched Transcription for Under-Resourced Speech Recognition , 2017, INTERSPEECH.

[8]  Bin Ma,et al.  Strategies for Vietnamese keyword search , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Richard M. Schwartz,et al.  The 2013 BBN Vietnamese telephone speech keyword spotting system , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Chi Mai Luong,et al.  The Effect of Tone Modeling in Vietnamese LVCSR System , 2016, SLTU.

[11]  Chiori Hori,et al.  A lecture transcription system combining neural network acoustic and language models , 2013, INTERSPEECH.

[12]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[13]  I-Fan Chen,et al.  A keyword-boosted sMBR criterion to enhance keyword search performance in deep neural network based acoustic modeling , 2014, INTERSPEECH.

[14]  Gunnar Evermann,et al.  Posterior probability decoding, confidence estimation and system combination , 2000 .

[15]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[16]  Haizhou Li,et al.  Kernel density-based acoustic model with cross-lingual bottleneck features for resource limited LVCSR , 2014, INTERSPEECH.

[17]  Haihua Xu,et al.  Minimum Bayes Risk decoding and system combination based on a recursion for edit distance , 2011, Comput. Speech Lang..

[18]  Van Hai Do,et al.  Development of a Vietnamese speech recognition system for Viettel call center , 2017, 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA).

[19]  Haizhou Li,et al.  Semi-supervised training for bottle-neck feature based DNN-HMM hybrid systems , 2014, INTERSPEECH.

[20]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .