论文信息 - Cross-Lingual Transfer Learning of Non-Native Acoustic Modeling for Pronunciation Error Detection and Diagnosis

Cross-Lingual Transfer Learning of Non-Native Acoustic Modeling for Pronunciation Error Detection and Diagnosis

In computer-assisted pronunciation training (CAPT), the scarcity of large-scale non-native corpora and human expert annotations are two fundamental challenges to non-native acoustic modeling. Most existing approaches of acoustic modeling in CAPT are based on non-native corpora while there are so many living languages in the world. It is impractical to collect and annotate every non-native speech corpus considering different language pairs. In this work, we address non-native acoustic modeling (both on phonetic and articulatory level) based on transfer learning. In order to effectively train acoustic models of non-native speech without using such data, we propose to exploit two large native speech corpora of learner's native language (L1) and target language (L2) to model cross-lingual phenomena. This kind of transfer learning can provide a better feature representation of non-native speech. Experimental evaluations are carried out for Japanese speakers learning English. We first demonstrate the proposed acoustic-phone model achieves a lower word error rate in non-native speech recognition. It also improves the pronunciation error detection based on goodness of pronunciation (GOP) score. For diagnosis of pronunciation errors, the proposed acoustic-articulatory modeling method is effective for providing detailed feedback at the articulation level.

[1] Henning Reetz,et al. Phonological feature-based speech recognition system for pronunciation training in non-native language learning. , 2018, The Journal of the Acoustical Society of America.

[2] Tatsuya Kawahara,et al. Automatic pronunciation error detection and guidance for foreign language learning , 1998, ICSLP.

[3] Pietro Laface,et al. On the use of a multilingual neural network front-end , 2008, INTERSPEECH.

[4] Dau-Cheng Lyu,et al. Experiments on Cross-Language Attribute Detection and Phone Recognition With Minimal Target-Specific Training Data , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[5] K SoongFrank,et al. KL-divergence based mispronunciation detection via DNN and decision tree in the phonetic space , 2016 .

[6] Wei Li,et al. Improving non-native mispronunciation detection and enriching diagnostic feedback with DNN-based speech attribute modeling , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Charles B. Chang,et al. Evidence for language transfer leading to a perceptual advantage for non-native listeners. , 2012, The Journal of the Acoustical Society of America.

[8] Jacques C. Koreman,et al. Universal contrastive analysis as a learning principle in CAPT , 2013, SLaTE.

[9] Diego Giuliani,et al. Non-Native Children Speech Recognition Through Transfer Learning , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Yoon Kim,et al. Automatic pronunciation scoring of specific phone segments for language instruction , 1997, EUROSPEECH.

[11] Sascha Fagel,et al. A 3-d virtual head as a tool for speech therapy for children , 2008, INTERSPEECH.

[12] Raymond D. Kent,et al. X‐ray microbeam speech production database , 1990 .

[13] Jean Paul Haton,et al. Multilingual non-native speech recognition using phonetic confusion-based acoustic model modification and graphemic constraints , 2006, INTERSPEECH.

[14] Martin Karafiát,et al. The language-independent bottleneck features , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[15] James R. Glass,et al. Mispronunciation detection without nonnative training data , 2015, INTERSPEECH.

[16] Manuela Boros,et al. Recognition of non-native German speech with multilingual recognizers , 1999, EUROSPEECH.

[17] Shuang Zhang,et al. Automatic derivation of phonological rules for mispronunciation detection in a computer-assisted pronunciation training system , 2010, INTERSPEECH.

[18] Hong Kook Kim,et al. Acoustic Model Adaptation Based on Pronunciation Variability Analysis for Non-Native Speech Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[19] Gérard Bailly,et al. Can you 'read' tongue movements? Evaluation of the contribution of tongue display to speech understanding , 2007, Speech Commun..

[20] Preeti Rao,et al. Vowel mispronunciation detection using DNN acoustic models with cross-lingual training , 2015, INTERSPEECH.

[21] Lan Wang,et al. Improving mispronunciation detection and diagnosis of learners' speech with context-sensitive phonological rules based on language transfer , 2008, INTERSPEECH.

[22] Leo Postman,et al. Role of response availability in transfer and interference , 1969 .

[23] Xiaoming Xi,et al. Improved pronunciation features for construct-driven assessment of non-native spontaneous speech , 2009, HLT-NAACL.

[24] Georg Heigold,et al. Multilingual acoustic models using distributed deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25] Chin-Hui Lee,et al. Boosting attribute and phone estimation accuracies with deep neural networks for detection-based speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26] Lin-Shan Lee,et al. Toward unsupervised discovery of pronunciation error patterns using universal phoneme posteriorgram for computer-assisted language learning , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27] Giulio Sandini,et al. New Technologies for Simultaneous Acquisition of Speech Articulatory Data : 3 D Articulograph , Ultrasound and Electroglottograph , 2008 .

[28] Alan A Wrench,et al. A MULTI-CHANNEL/MULTI-SPEAKER ARTICULATORY DATABASE FOR CONTINUOUS SPEECH RECOGNITION RESEARCH , 2000 .

[29] Srinivasan Umesh,et al. An automated technique to generate phone-to-articulatory label mapping , 2017, Speech Commun..

[30] Jinsong Zhang,et al. Articulatory Modeling for Pronunciation Error Detection without Non-Native Training Data Based on DNN Transfer Learning , 2017, IEICE Trans. Inf. Syst..

[31] Yong Wang,et al. Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers , 2015, Speech Commun..

[32] Wai Kit Lo,et al. Implementation of an extended recognition network for mispronunciation detection and diagnosis in computer-assisted pronunciation training , 2009, SLaTE.

[33] Yifan Gong,et al. Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[34] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[35] Qiang Yang,et al. A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[36] Frank K. Soong,et al. Automatic mispronunciation detection for Mandarin , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[37] Shuichi Itahashi,et al. JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research , 1999 .

[38] Vassilios Digalakis,et al. Automatic pronunciation evaluation of foreign speakers using unknown text , 2007, Comput. Speech Lang..

[39] Tanja Schultz,et al. Comparison of acoustic model adaptation techniques on non-native speech , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[40] Frank K. Soong,et al. A Two-Pass Framework of Mispronunciation Detection and Diagnosis for Computer-Aided Pronunciation Training , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[41] Steve J. Young,et al. Phone-level pronunciation scoring and assessment for interactive language learning , 2000, Speech Commun..

[42] Magne Hallstein Johnsen,et al. Automatic evaluation of quantity contrast in non-native Norwegian speech , 2009, SLaTE.

[43] Jianping Li,et al. Attribute knowledge integration for speech recognition based on multi-task learning neural networks , 2015, INTERSPEECH.

[44] Stephanie Seneff,et al. An interactive English pronunciation dictionary for Korean learners , 2004, INTERSPEECH.

[45] Jinsong Zhang,et al. A preliminary study on ASR-based detection of Chinese mispronunciation by Japanese learners , 2014, INTERSPEECH.

[46] Helmer Strik,et al. Automatic pronunciation error detection: an acoustic-phonetic approach , 2004 .

[47] Arumugam Rathinavelu,et al. Three Dimensional Articulator Model for Speech Acquisition by Children with Hearing Loss , 2007, HCI.

[48] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49] Tatsuya Kawahara,et al. Recognition and verification of English by Japanese students for computer-assisted language learning system , 2002, INTERSPEECH.

[50] Yoshua Bengio,et al. Deep Learning of Representations for Unsupervised and Transfer Learning , 2011, ICML Unsupervised and Transfer Learning.

[51] Bin Liu,et al. Estimate articulatory MRI series from acoustic signal using deep architecture , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[52] Yuen Yee Lo,et al. Deriving salient learners’ mispronunciations from cross-language phonological comparisons , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[53] Rong Tong,et al. Large-scale characterization of non-native Mandarin Chinese spoken by speakers of European origin: Analysis on iCALL , 2016, Speech Commun..

[54] Stefan Schaden. Generating Non - Native Pronuncia - tion Lexicons by Phonological Rule , 2003 .

[55] Janet M. Baker,et al. The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[56] K Tanaka,et al. Acoustic Models of Language-Independent Phonetic Code Systems for Speech Processing , 2001 .

[57] Peng Hao,et al. Transfer learning using computational intelligence: A survey , 2015, Knowl. Based Syst..

[58] Jinsong Zhang,et al. Effective articulatory modeling for pronunciation error detection of L2 learner without non-native training data , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[59] Peter Stone,et al. Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[60] Horacio Franco,et al. Automatic detection of phone-level mispronunciation for language learning , 1999, EUROSPEECH.

[61] Helmer Strik,et al. Comparing different approaches for automatic pronunciation error detection , 2009, Speech Commun..