Mispronunciation Detection and Diagnosis in L2 English Speech Using Multidistribution Deep Neural Networks

This paper investigates the use of multidistribution deep neural networks (DNNs) for mispronunciation detection and diagnosis (MDD), to circumvent the difficulties encountered in an existing approach based on extended recognition networks (ERNs). The ERNs leverage existing automatic speech recognition technology by constraining the search space via including the likely phonetic error patterns of the target words in addition to the canonical transcriptions. MDDs are achieved by comparing the recognized transcriptions with the canonical ones. Although this approach performs reasonably well, it has the following issues: 1) Learning the error patterns of the target words to generate the ERNs remains a challenging task. Phones or phone errors missing from the ERNs cannot be recognized even if we have well-trained acoustic models; and 2) acoustic models and phonological rules are trained independently, and hence, contextual information is lost. To address these issues, we propose an acoustic-graphemic-phonemic model (AGPM) using a multidistribution DNN, whose input features include acoustic features, as well as corresponding graphemes and canonical transcriptions (encoded as binary vectors). The AGPM can implicitly model both grapheme-to-likely-pronunciation and phoneme-to-likely-pronunciation conversions, which are integrated into acoustic modeling. With the AGPM, we develop a unified MDD framework, which works much like free-phone recognition. Experiments show that our method achieves a phone error rate (PER) of 11.1%. The false rejection rate (FRR), false acceptance rate (FAR), and diagnostic error rate (DER) for MDD are 4.6%, 30.5%, and 13.5%, respectively. It outperforms the ERN approach using DNNs as acoustic models, whose PER, FRR, FAR, and DER are 16.8%, 11.0%, 43.6%, and 32.3%, respectively.

[1]  Yoon Kim,et al.  Automatic pronunciation scoring for language instruction , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Horacio Franco,et al.  Automatic detection of mispronunciation for language instruction , 1997, EUROSPEECH.

[3]  Chiu-yu Tseng,et al.  Studying L2 suprasegmental features in asian Englishes: a position paper , 2009, INTERSPEECH.

[4]  Victor Zue,et al.  Speech database development at MIT: Timit and beyond , 1990, Speech Commun..

[5]  Xuejing Sun,et al.  Pitch accent prediction using ensemble machine learning , 2002, INTERSPEECH.

[6]  Fabio Tamburini,et al.  Prosodic prominence detection in speech , 2003, Seventh International Symposium on Signal Processing and Its Applications, 2003. Proceedings..

[7]  Kun Li,et al.  Lexical stress detection for L2 English speech using deep belief networks , 2013, INTERSPEECH.

[8]  Timothy J. Hazen,et al.  Query-by-example spoken term detection using phonetic posteriorgram templates , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[9]  Jia Liu,et al.  Perceptual Evaluation of Pronunciation Quality for Computer Assisted Language Learning , 2006, Edutainment.

[10]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[11]  Keikichi Hirose,et al.  A method for measuring the intelligibility and nonnativeness of phone quality in foreign language pronunciation training , 1998, ICSLP.

[12]  Paul Taylor,et al.  Hidden Markov models for grapheme to phoneme conversion , 2005, INTERSPEECH.

[13]  Frank K. Soong,et al.  A new DNN-based high quality pronunciation evaluation for computer-aided language learning (CALL) , 2013, INTERSPEECH.

[14]  Lin-Shan Lee,et al.  Improved approaches of modeling and detecting Error Patterns with empirical analysis for Computer-Aided Pronunciation Training , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Jia Liu,et al.  Automatic spoken English test for Chinese learners , 2005, Proceedings. 2005 International Conference on Communications, Circuits and Systems, 2005..

[16]  Kun Li,et al.  Spoken English assessment system for non-native speakers using acoustic and prosodic features , 2010, INTERSPEECH.

[17]  B. Everitt,et al.  Statistical methods for rates and proportions , 1973 .

[18]  Frank K. Soong,et al.  On Mispronunciation Lexicon Generation Using Joint-Sequence Multigrams in Computer-Aided Pronunciation Training (CAPT) , 2011, INTERSPEECH.

[19]  Frank K. Soong,et al.  Discriminative acoustic model for improving mispronunciation detection and diagnosis in computer-aided pronunciation training (CAPT) , 2010, INTERSPEECH.

[20]  Lin-Shan Lee,et al.  Toward unsupervised discovery of pronunciation error patterns using universal phoneme posteriorgram for computer-assisted language learning , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Yonghong Yan,et al.  Automatic assessment of pronunciation quality , 2004, 2004 International Symposium on Chinese Spoken Language Processing.

[22]  Mounya Elhilali,et al.  Exploiting temporal coherence in speech for data-driven feature extraction , 2011, 2011 45th Annual Conference on Information Sciences and Systems.

[23]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Shuang Zhang,et al.  Automatic derivation of phonological rules for mispronunciation detection in a computer-assisted pronunciation training system , 2010, INTERSPEECH.

[25]  Kun Li,et al.  Mispronunciation detection and diagnosis in l2 english speech using multi-distribution Deep Neural Networks , 2014, The 9th International Symposium on Chinese Spoken Language Processing.

[26]  Jeff A. Bilmes,et al.  Hidden-articulator Markov models for speech recognition , 2003, Speech Commun..

[27]  Maxine Eskénazi,et al.  An overview of spoken language technology for education , 2009, Speech Commun..

[28]  Yoon Kim,et al.  Automatic pronunciation scoring of specific phone segments for language instruction , 1997, EUROSPEECH.

[29]  Howard C. Nusbaum,et al.  Pronounce : a program for pronunciation by analogy , 1991 .

[30]  Enikö Beatrice Bilcu Text-To-Phoneme Mapping Using Neural Networks , 2008 .

[31]  Li Deng,et al.  Ensemble deep learning for speech recognition , 2014, INTERSPEECH.

[32]  Lou Boves,et al.  Assessment of dutch pronunciation by means of automatic speech recognition technology , 1998, ICSLP.

[33]  Jennifer Cole,et al.  Speaker-Independent Automatic Detection of Pitch Accent , 2004 .

[34]  Joost van Doremalen,et al.  Using non-native error patterns to improve pronunciation verification , 2010, INTERSPEECH.

[35]  Hua Yuan,et al.  Exploiting contextual information for prosodic event detection using auto-context , 2013, EURASIP J. Audio Speech Music. Process..

[36]  P. Ladefoged A course in phonetics , 1975 .

[37]  Victor Zue,et al.  Reversible letter-to-sound/sound-to-letter generation based on parsing word morpology , 1993, Speech Commun..

[38]  Lan Wang,et al.  Improving mispronunciation detection and diagnosis of learners' speech with context-sensitive phonological rules based on language transfer , 2008, INTERSPEECH.

[39]  Yongqiang Wang,et al.  An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[40]  Martin Kay,et al.  Regular Models of Phonological Rule Systems , 1994, CL.

[41]  Helmer Strik,et al.  Automatic pronunciation error detection: an acoustic-phonetic approach , 2004 .

[42]  Edward Gibson,et al.  A comparison of inter-transcriber reliability for two systems of prosodic annotation: rap (rhythm and pitch) and toBI (tones and break indices) , 2006, INTERSPEECH.

[43]  Néstor Becerra Yoma,et al.  Automatic intonation assessment for computer aided language learning , 2010, Speech Commun..

[44]  John A. Bullinaria Text to phoneme alignment and mapping for speech technology: A neural networks approach , 2011, The 2011 International Joint Conference on Neural Networks.

[45]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[46]  Xu Li,et al.  Automatic lexical stress and pitch accent detection for L2 English speech using multi-distribution deep neural networks , 2018, Speech Commun..

[47]  James R. Glass,et al.  Context-dependent pronunciation error pattern discovery with limited annotations , 2014, INTERSPEECH.

[48]  Kristin Precoda,et al.  EduSpeak®: A speech recognition and pronunciation scoring toolkit for computer-aided language learning applications , 2010 .

[49]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[50]  Li Deng,et al.  A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[51]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[52]  Lin-Shan Lee,et al.  Supervised Detection and Unsupervised Discovery of Pronunciation Error Patterns for Computer-Assisted Language Learning , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[53]  Yong Wang,et al.  Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers , 2015, Speech Commun..

[54]  Frank K. Soong,et al.  Capturing L2 segmental mispronunciations with joint-sequence models in Computer-Aided Pronunciation Training (CAPT) , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[55]  Shuang Zhang,et al.  Prominence Model for Prosodic Features in Automatic Lexical Stress and Pitch Accent Detection , 2011, INTERSPEECH.

[56]  Rong Zheng,et al.  Asynchronous stochastic gradient descent for DNN training , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[57]  James R. Glass,et al.  Pronunciation assessment via a comparison-based system , 2013, SLaTE.

[58]  Mitch Weintraub,et al.  Automatic text-independent pronunciation scoring of foreign language student speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[59]  Frank K. Soong,et al.  A new Neural Network based logistic regression classifier for improving mispronunciation detection of L2 language learners , 2014, The 9th International Symposium on Chinese Spoken Language Processing.

[60]  Mitch Weintraub,et al.  Automatic scoring of pronunciation quality , 2000, Speech Commun..

[61]  Monika Podsiadlo,et al.  Text-to-speech with cross-lingual neural network-based grapheme-to-phoneme models , 2014, INTERSPEECH.

[62]  Shrikanth S. Narayanan,et al.  Using Articulatory Representations to Detect Segmental Errors in Nonnative Pronunciation , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[63]  Terrence J. Sejnowski,et al.  Parallel Networks that Learn to Pronounce English Text , 1987, Complex Syst..

[64]  Yu Hu,et al.  A new method for mispronunciation detection using Support Vector Machine based on Pronunciation Space Models , 2009, Speech Commun..

[65]  Hermann Ney,et al.  Joint-sequence models for grapheme-to-phoneme conversion , 2008, Speech Commun..

[66]  Jyh-Shing Roger Jang,et al.  Automatic pronunciation assessment for Mandarin Chinese , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[67]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[68]  James R. Glass,et al.  Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[69]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[70]  Shuang Zhang,et al.  Detection of intonation in L2 English speech of native Mandarin learners , 2010, 2010 7th International Symposium on Chinese Spoken Language Processing.

[71]  James R. Glass,et al.  Mispronunciation detection via dynamic time warping on deep belief network-based posteriorgrams , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[72]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[73]  Yik-Cheung Tam,et al.  PLASER: Pronunciation Learning via Automatic Speech Recognition , 2003, HLT-NAACL 2003.

[74]  Kun Li,et al.  Integrating acoustic and state-transition models for free phone recognition in L2 English speech using multi-distribution deep neural networks , 2015, SLaTE.

[75]  Kun Li,et al.  Perceptually-motivated assessment of automatically detected lexical stress in L2 learners' speech , 2012, 2012 8th International Symposium on Chinese Spoken Language Processing.

[76]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[77]  Thomas Hain,et al.  Automatic assessment of English learner pronunciation using discriminative classifiers , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[78]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[79]  Jerome R. Bellegarda Unsupervised, language-independent grapheme-to-phoneme conversion by latent analogy , 2005, Speech Commun..

[80]  J. Fleiss,et al.  Statistical methods for rates and proportions , 1973 .

[81]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[82]  Wai Kit Lo,et al.  Implementation of an extended recognition network for mispronunciation detection and diagnosis in computer-assisted pronunciation training , 2009, SLaTE.

[83]  Lyle F. Bachman 语言测试要略 = Fundamental considerations in language testing , 1990 .

[84]  James R. Glass,et al.  A Comparison-based Approach to Mispronunciation Detection by , 2012 .

[85]  Stanley F. Chen,et al.  Conditional and joint models for grapheme-to-phoneme conversion , 2003, INTERSPEECH.

[86]  Yuen Yee Lo,et al.  Deriving salient learners’ mispronunciations from cross-language phonological comparisons , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[87]  Tatsuya Kawahara,et al.  Modeling and automatic detection of English sentence stress for computer-assisted English prosody learning system , 2002, INTERSPEECH.

[88]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[89]  Jerome R. Bellegarda,et al.  Unsupervised, language-independent grapheme-to-phoneme conversion by latent analogy , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[90]  Shrikanth S. Narayanan,et al.  Automatic syllable stress detection using prosodic features for pronunciation evaluation of language learners , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[91]  Steve J. Young,et al.  Phone-level pronunciation scoring and assessment for interactive language learning , 2000, Speech Commun..

[92]  Walter Daelemans,et al.  Language-Independent Data-Oriented Grapheme-to-Phoneme Conversion , 1996 .

[93]  Tara N. Sainath,et al.  Deep Belief Networks using discriminative features for phone recognition , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[94]  Edward Gibson,et al.  Inter-transcriber reliability for two systems of prosodic annotation: ToBI (Tones and Break Indices) and RaP (Rhythm and Pitch) , 2012 .

[95]  Helen M. Meng,et al.  Statistical parametric speech synthesis using weighted multi-distribution deep belief network , 2014, INTERSPEECH.

[96]  Frank K. Soong,et al.  The Use of DBN-HMMs for Mispronunciation Detection and Diagnosis in L2 English to Support Computer-Aided Pronunciation Training , 2012, INTERSPEECH.

[97]  Helen M. Meng,et al.  Multi-distribution deep belief network for speech synthesis , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.