Cross-Lingual Transfer Learning of Non-Native Acoustic Modeling for Pronunciation Error Detection and Diagnosis

In computer-assisted pronunciation training (CAPT), the scarcity of large-scale non-native corpora and human expert annotations are two fundamental challenges to non-native acoustic modeling. Most existing approaches of acoustic modeling in CAPT are based on non-native corpora while there are so many living languages in the world. It is impractical to collect and annotate every non-native speech corpus considering different language pairs. In this work, we address non-native acoustic modeling (both on phonetic and articulatory level) based on transfer learning. In order to effectively train acoustic models of non-native speech without using such data, we propose to exploit two large native speech corpora of learner's native language (L1) and target language (L2) to model cross-lingual phenomena. This kind of transfer learning can provide a better feature representation of non-native speech. Experimental evaluations are carried out for Japanese speakers learning English. We first demonstrate the proposed acoustic-phone model achieves a lower word error rate in non-native speech recognition. It also improves the pronunciation error detection based on goodness of pronunciation (GOP) score. For diagnosis of pronunciation errors, the proposed acoustic-articulatory modeling method is effective for providing detailed feedback at the articulation level.

[1]  Henning Reetz,et al.  Phonological feature-based speech recognition system for pronunciation training in non-native language learning. , 2018, The Journal of the Acoustical Society of America.

[2]  Tatsuya Kawahara,et al.  Automatic pronunciation error detection and guidance for foreign language learning , 1998, ICSLP.

[3]  Pietro Laface,et al.  On the use of a multilingual neural network front-end , 2008, INTERSPEECH.

[4]  Dau-Cheng Lyu,et al.  Experiments on Cross-Language Attribute Detection and Phone Recognition With Minimal Target-Specific Training Data , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  K SoongFrank,et al.  KL-divergence based mispronunciation detection via DNN and decision tree in the phonetic space , 2016 .

[6]  Wei Li,et al.  Improving non-native mispronunciation detection and enriching diagnostic feedback with DNN-based speech attribute modeling , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Charles B. Chang,et al.  Evidence for language transfer leading to a perceptual advantage for non-native listeners. , 2012, The Journal of the Acoustical Society of America.

[8]  Jacques C. Koreman,et al.  Universal contrastive analysis as a learning principle in CAPT , 2013, SLaTE.

[9]  Diego Giuliani,et al.  Non-Native Children Speech Recognition Through Transfer Learning , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Yoon Kim,et al.  Automatic pronunciation scoring of specific phone segments for language instruction , 1997, EUROSPEECH.

[11]  Sascha Fagel,et al.  A 3-d virtual head as a tool for speech therapy for children , 2008, INTERSPEECH.

[12]  Raymond D. Kent,et al.  X‐ray microbeam speech production database , 1990 .

[13]  Jean Paul Haton,et al.  Multilingual non-native speech recognition using phonetic confusion-based acoustic model modification and graphemic constraints , 2006, INTERSPEECH.

[14]  Martin Karafiát,et al.  The language-independent bottleneck features , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[15]  James R. Glass,et al.  Mispronunciation detection without nonnative training data , 2015, INTERSPEECH.

[16]  Manuela Boros,et al.  Recognition of non-native German speech with multilingual recognizers , 1999, EUROSPEECH.

[17]  Shuang Zhang,et al.  Automatic derivation of phonological rules for mispronunciation detection in a computer-assisted pronunciation training system , 2010, INTERSPEECH.

[18]  Hong Kook Kim,et al.  Acoustic Model Adaptation Based on Pronunciation Variability Analysis for Non-Native Speech Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[19]  Gérard Bailly,et al.  Can you 'read' tongue movements? Evaluation of the contribution of tongue display to speech understanding , 2007, Speech Commun..

[20]  Preeti Rao,et al.  Vowel mispronunciation detection using DNN acoustic models with cross-lingual training , 2015, INTERSPEECH.

[21]  Lan Wang,et al.  Improving mispronunciation detection and diagnosis of learners' speech with context-sensitive phonological rules based on language transfer , 2008, INTERSPEECH.

[22]  Leo Postman,et al.  Role of response availability in transfer and interference , 1969 .

[23]  Xiaoming Xi,et al.  Improved pronunciation features for construct-driven assessment of non-native spontaneous speech , 2009, HLT-NAACL.

[24]  Georg Heigold,et al.  Multilingual acoustic models using distributed deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  Chin-Hui Lee,et al.  Boosting attribute and phone estimation accuracies with deep neural networks for detection-based speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Lin-Shan Lee,et al.  Toward unsupervised discovery of pronunciation error patterns using universal phoneme posteriorgram for computer-assisted language learning , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[27]  Giulio Sandini,et al.  New Technologies for Simultaneous Acquisition of Speech Articulatory Data : 3 D Articulograph , Ultrasound and Electroglottograph , 2008 .

[28]  Alan A Wrench,et al.  A MULTI-CHANNEL/MULTI-SPEAKER ARTICULATORY DATABASE FOR CONTINUOUS SPEECH RECOGNITION RESEARCH , 2000 .

[29]  Srinivasan Umesh,et al.  An automated technique to generate phone-to-articulatory label mapping , 2017, Speech Commun..

[30]  Jinsong Zhang,et al.  Articulatory Modeling for Pronunciation Error Detection without Non-Native Training Data Based on DNN Transfer Learning , 2017, IEICE Trans. Inf. Syst..

[31]  Yong Wang,et al.  Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers , 2015, Speech Commun..

[32]  Wai Kit Lo,et al.  Implementation of an extended recognition network for mispronunciation detection and diagnosis in computer-assisted pronunciation training , 2009, SLaTE.

[33]  Yifan Gong,et al.  Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[34]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[35]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[36]  Frank K. Soong,et al.  Automatic mispronunciation detection for Mandarin , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[37]  Shuichi Itahashi,et al.  JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research , 1999 .

[38]  Vassilios Digalakis,et al.  Automatic pronunciation evaluation of foreign speakers using unknown text , 2007, Comput. Speech Lang..

[39]  Tanja Schultz,et al.  Comparison of acoustic model adaptation techniques on non-native speech , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[40]  Frank K. Soong,et al.  A Two-Pass Framework of Mispronunciation Detection and Diagnosis for Computer-Aided Pronunciation Training , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[41]  Steve J. Young,et al.  Phone-level pronunciation scoring and assessment for interactive language learning , 2000, Speech Commun..

[42]  Magne Hallstein Johnsen,et al.  Automatic evaluation of quantity contrast in non-native Norwegian speech , 2009, SLaTE.

[43]  Jianping Li,et al.  Attribute knowledge integration for speech recognition based on multi-task learning neural networks , 2015, INTERSPEECH.

[44]  Stephanie Seneff,et al.  An interactive English pronunciation dictionary for Korean learners , 2004, INTERSPEECH.

[45]  Jinsong Zhang,et al.  A preliminary study on ASR-based detection of Chinese mispronunciation by Japanese learners , 2014, INTERSPEECH.

[46]  Helmer Strik,et al.  Automatic pronunciation error detection: an acoustic-phonetic approach , 2004 .

[47]  Arumugam Rathinavelu,et al.  Three Dimensional Articulator Model for Speech Acquisition by Children with Hearing Loss , 2007, HCI.

[48]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  Tatsuya Kawahara,et al.  Recognition and verification of English by Japanese students for computer-assisted language learning system , 2002, INTERSPEECH.

[50]  Yoshua Bengio,et al.  Deep Learning of Representations for Unsupervised and Transfer Learning , 2011, ICML Unsupervised and Transfer Learning.

[51]  Bin Liu,et al.  Estimate articulatory MRI series from acoustic signal using deep architecture , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[52]  Yuen Yee Lo,et al.  Deriving salient learners’ mispronunciations from cross-language phonological comparisons , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[53]  Rong Tong,et al.  Large-scale characterization of non-native Mandarin Chinese spoken by speakers of European origin: Analysis on iCALL , 2016, Speech Commun..

[54]  Stefan Schaden Generating Non - Native Pronuncia - tion Lexicons by Phonological Rule , 2003 .

[55]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[56]  K Tanaka,et al.  Acoustic Models of Language-Independent Phonetic Code Systems for Speech Processing , 2001 .

[57]  Peng Hao,et al.  Transfer learning using computational intelligence: A survey , 2015, Knowl. Based Syst..

[58]  Jinsong Zhang,et al.  Effective articulatory modeling for pronunciation error detection of L2 learner without non-native training data , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[59]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[60]  Horacio Franco,et al.  Automatic detection of phone-level mispronunciation for language learning , 1999, EUROSPEECH.

[61]  Helmer Strik,et al.  Comparing different approaches for automatic pronunciation error detection , 2009, Speech Commun..