End-to-End Articulatory Attribute Modeling for Low-Resource Multilingual Speech Recognition

The end-to-end (E2E) model allows for training of automatic speech recognition (ASR) systems without the hand-designed language-specific pronunciation lexicons. However, constructing the multilingual low-resource E2E ASR system is still challenging due to the vast number of symbols (e.g., words and characters). In this paper, we investigate an efficient method of encoding multilingual transcriptions for training E2E ASR systems. We directly encode the symbols of multilingual writing systems to universal articulatory representations, which is much more compact than characters and words. Compared with traditional multilingual modeling methods, we directly build a single acoustic-articulatory within recent transformerbased E2E framework for ASR tasks. The speech recognition results of our proposed method significantly outperform the conventional word-based and character-based E2E models.

[1]  Wei Li,et al.  Improving non-native mispronunciation detection and enriching diagnostic feedback with DNN-based speech attribute modeling , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Giulio Sandini,et al.  New Technologies for Simultaneous Acquisition of Speech Articulatory Data : 3 D Articulograph , Ultrasound and Electroglottograph , 2008 .

[3]  Shuang Xu,et al.  Multilingual End-to-End Speech Recognition with A Single Transformer on Low-Resource Languages , 2018, ArXiv.

[4]  Hsiao-Chuan Wang,et al.  Attribute-based Mandarin speech recognition using conditional random fields , 2007, INTERSPEECH.

[5]  Tanja Schultz,et al.  Language-independent and language-adaptive acoustic modeling for speech recognition , 2001, Speech Commun..

[6]  Dirk Van Compernolle,et al.  A study of rank-constrained multilingual DNNS for low-resource ASR , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Jinsong Zhang,et al.  Effective articulatory modeling for pronunciation error detection of L2 learner without non-native training data , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Horacio Franco,et al.  Articulatory Features for ASR of Pathological Speech , 2018, INTERSPEECH.

[9]  Quoc V. Le,et al.  Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Tara N. Sainath,et al.  No Need for a Lexicon? Evaluating the Value of the Pronunciation Lexica in End-to-End Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[12]  Jianping Li,et al.  Attribute knowledge integration for speech recognition based on multi-task learning neural networks , 2015, INTERSPEECH.

[13]  Alan A Wrench,et al.  A MULTI-CHANNEL/MULTI-SPEAKER ARTICULATORY DATABASE FOR CONTINUOUS SPEECH RECOGNITION RESEARCH , 2000 .

[14]  Gérard Bailly,et al.  Can you 'read' tongue movements? Evaluation of the contribution of tongue display to speech understanding , 2007, Speech Commun..

[15]  Chiori Hori,et al.  A Myanmar large vocabulary continuous speech recognition system , 2015, 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[16]  Sascha Fagel,et al.  A 3-d virtual head as a tool for speech therapy for children , 2008, INTERSPEECH.

[17]  Yifan Gong,et al.  Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Srinivasan Umesh,et al.  An automated technique to generate phone-to-articulatory label mapping , 2017, Speech Commun..

[19]  Wen Wang,et al.  Articulatory Information and Multiview Features for Large Vocabulary Continuous Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Bin Liu,et al.  Estimate articulatory MRI series from acoustic signal using deep architecture , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Elliot Saltzman,et al.  Hybrid convolutional neural networks for articulatory and acoustic information based speech recognition , 2017, Speech Commun..

[22]  Ngoc Thang Vu,et al.  Multilingual deep neural network based acoustic modeling for rapid language adaptation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Katrin Kirchhoff,et al.  Robust speech recognition using articulatory information , 1998 .

[24]  Shuang Xu,et al.  Syllable-Based Sequence-to-Sequence Speech Recognition with the Transformer in Mandarin Chinese , 2018, INTERSPEECH.

[25]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[26]  Hui Chen,et al.  Phoneme-level articulatory animation in pronunciation training , 2012, Speech Commun..

[27]  Raymond D. Kent,et al.  X‐ray microbeam speech production database , 1990 .

[28]  Elizabeth Shriberg,et al.  Effects of feature type, learning algorithm and speaking style for depression detection from speech , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Hao Zheng,et al.  AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline , 2017, 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA).

[30]  Shrikanth Narayanan,et al.  An approach to real-time magnetic resonance imaging for speech production. , 2003, The Journal of the Acoustical Society of America.

[31]  C.-H. Lee,et al.  From knowledge-ignorant to knowledge-rich modeling : a new speech research parading for next generation automatic speech recognition , 2004 .

[32]  Biing-Hwang Juang,et al.  An overview on automatic speech attribute transcription (ASAT) , 2007, INTERSPEECH.

[33]  Eric Fosler-Lussier,et al.  Combining phonetic attributes using conditional random fields , 2006, INTERSPEECH.

[34]  Arumugam Rathinavelu,et al.  Three Dimensional Articulator Model for Speech Acquisition by Children with Hearing Loss , 2007, HCI.

[35]  Richard C. Rose,et al.  Multi-lingual speech recognition with low-rank multi-task deep neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[36]  Sanjeev Khudanpur,et al.  Audio augmentation for speech recognition , 2015, INTERSPEECH.