Gated Module Neural Network for Multilingual Speech Recognition

For most multilingual large vocabulary continuous speech recognition (LVCSR) systems, when multiple languages are allowed at the same time, their performance will degrade significantly due to the strong inter-language competition in the decoding phase. To increase the inter-language discrimination capacity, this paper presents a gated module neural network (GMN) approach that adapts a language identification (LID) component to directly assist the final multilingual LVCSR goal. Thanks to an international collaboration 3 large-scale speech corpora (Mandarin, English and Slovak, denoted as Zh, En and Sk) were shared for studying this problem. Hence the proposed approach was evaluated on both bilingual (Zh/En and Sk/En) and trilingual (Zh/En/Sk) LVCSR tasks. The experimental results show that the proposed GMN is promising and the performance of multilingual LVCSRs are now more comparable with the monolingual ones.

[1]  Matús Pleva,et al.  TUKE-BNews-SK: Slovak Broadcast News Corpus Construction and Evaluation , 2014, LREC.

[2]  Stanislav Ondás,et al.  Domain-specific language models training methodology for the in-car infotainment , 2017, Intell. Decis. Technol..

[3]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[5]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[6]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.

[7]  Hao Zheng,et al.  AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline , 2017, 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA).

[8]  Bhuvana Ramabhadran,et al.  Joint Modeling of Accents and Acoustics for Multi-Accent Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Yifan Gong,et al.  Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Alex Waibel,et al.  Review of TDNN (time delay neural network) architectures for speech recognition , 1991, 1991., IEEE International Sympoisum on Circuits and Systems.

[11]  Sheng-ming Wang,et al.  Development of a large-scale Mandarin Radio Speech Corpus , 2017, 2017 IEEE International Conference on Consumer Electronics - Taiwan (ICCE-TW).

[12]  Patrick Kenny,et al.  Front-End Factor Analysis for Speaker Verification , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Marián Trnka,et al.  Advances in the Slovak Judicial Domain Dictation System , 2013, LTC.

[14]  Dong Wang,et al.  Multi-task recurrent model for true multilingual speech recognition , 2016, 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[15]  Hsin-Min Wang,et al.  MATBN: A Mandarin Chinese Broadcast News Corpus , 2005, Int. J. Comput. Linguistics Chin. Lang. Process..