A Two-Pass Framework of Mispronunciation Detection and Diagnosis for Computer-Aided Pronunciation Training

This paper presents a two-pass framework with discriminative acoustic modeling for mispronunciation detection and diagnoses (MD&D). The first pass of mispronunciation detection does not require explicit phonetic error pattern modeling. The framework instantiates a set of antiphones and a filler model to augment the original phone model for each canonical phone. This guarantees full coverage of all possible error patterns while maximally exploiting the phonetic information derived from the text prompt. The antiphones can be used to detect substitutions. The filler model can detect insertions, and phone skips are allowed to detect deletions. As such, there is no prior assumption on the possible error patterns that can occur. The second pass of mispronunciation diagnosis expands the detected insertions and substitutions into phone networks, and another recognition pass attempts to reveal the phonetic identities of the detected mispronunciation errors. Discriminative training (DT) is applied respectively to the acoustic models of the mispronunciation detection pass and the mispronunciation diagnosis pass. DT effectively separates the acoustic models of the canonical phones and the antiphones. Overall, with DT in both passes of MD&D, the error rate is reduced by 40.4% relative, compared with the maximum likelihood baseline. After DT, the error rates of the respective passes are also lower than those of a strong single-pass baseline with DT by 1.3% and 5.1% relative which are statistically significant.

[1]  Steve J. Young,et al.  Phone-level pronunciation scoring and assessment for interactive language learning , 2000, Speech Commun..

[2]  Thomas Niesler,et al.  Automatic assessment of oral language proficiency and listening comprehension , 2009, Speech Commun..

[3]  Theban Stanley,et al.  Statistical machine translation framework for modeling phonological errors in computer assisted pronunciation training system , 2011, SLaTE.

[4]  Helmer Strik,et al.  ASR corrective feedback on pronunciation: Does it really work? , 2006 .

[5]  Mitch Weintraub,et al.  Automatic text-independent pronunciation scoring of foreign language student speech , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[6]  Brian Kingsbury,et al.  Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Joseph P. Campbell,et al.  Characterizing Phonetic Transformations and Acoustic Differences Across English Dialects , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[8]  Joost van Doremalen,et al.  Optimizing Automatic Speech Recognition for Low-Proficient Non-Native Speakers , 2010, EURASIP J. Audio Speech Music. Process..

[9]  James R. Glass,et al.  A Comparison-based Approach to Mispronunciation Detection by , 2012 .

[10]  Yuen Yee Lo,et al.  Deriving salient learners’ mispronunciations from cross-language phonological comparisons , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[11]  Yoon Kim,et al.  Automatic pronunciation scoring of specific phone segments for language instruction , 1997, EUROSPEECH.

[12]  Mark J. T. Smith,et al.  Adaptive frequency cepstral coefficients for word mispronunciation detection , 2011, 2011 4th International Congress on Image and Signal Processing.

[13]  Maxine Eskénazi,et al.  Detection of foreign speakers' pronunciation errors for second language training-preliminary results , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[14]  Lin-Shan Lee,et al.  Toward unsupervised discovery of pronunciation error patterns using universal phoneme posteriorgram for computer-assisted language learning , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Shrikanth S. Narayanan,et al.  Using Articulatory Representations to Detect Segmental Errors in Nonnative Pronunciation , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Bo Xu,et al.  Context-Dependent Duration Modeling with Backoff Strategy and Look-Up Tables for Pronunciation Assessment and Mispronunciation Detection , 2011, INTERSPEECH.

[17]  Tatsuya Kawahara,et al.  Effective error prediction using decision tree for ASR grammar network in call system , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  James R. Glass,et al.  Mispronunciation detection without nonnative training data , 2015, INTERSPEECH.

[19]  Horacio Franco,et al.  Automatic detection of phone-level mispronunciation for language learning , 1999, EUROSPEECH.

[20]  Néstor Becerra Yoma,et al.  ASR based pronunciation evaluation with automatically generated competing vocabulary and classifier fusion , 2009, Speech Commun..

[21]  Kristin Precoda,et al.  EduSpeak®: A speech recognition and pronunciation scoring toolkit for computer-aided language learning applications , 2010 .

[22]  Khiet P. Truong,et al.  Automatic pronunciation error detection in Dutch as a second language: an acoustic-phonetic approach , 2004 .

[23]  Wolfgang Menzel,et al.  Phonetic Rules for Diagnosis of Pronunciation Errors , 2000, KONVENS.

[24]  Khe Chai Sim Improving phone verification using state-level posterior features and support vector machine for automatic mispronunciation detection , 2009, SLaTE.

[25]  Jonathan Le Roux,et al.  Discriminative Training for Large-Vocabulary Speech Recognition Using Minimum Classification Error , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[26]  Hui Jiang,et al.  Confidence measures for speech recognition: A survey , 2005, Speech Commun..

[27]  Helmer Strik,et al.  ASR-based corrective feedback on pronunciation: does it really work? , 2006, INTERSPEECH.

[28]  Wolfgang Menzel,et al.  Automatic detection and correction of non-native English pronunciations , 2000 .

[29]  Lingyun Gu,et al.  SLAP: a system for the detection and correction of pronunciation for second language acquisition , 2003, Proceedings of the 2003 International Symposium on Circuits and Systems, 2003. ISCAS '03..

[30]  Shuang Zhang,et al.  Automatic derivation of phonological rules for mispronunciation detection in a computer-assisted pronunciation training system , 2010, INTERSPEECH.

[31]  Akinori Ito,et al.  Pronunciation error detection for computer-assisted language learning system based on error rule clustering using a decision tree , 2007 .

[32]  Yu Hu,et al.  A new method for mispronunciation detection using Support Vector Machine based on Pronunciation Space Models , 2009, Speech Commun..

[33]  Eric Atwell,et al.  Automatic localization and diagnosis of pronunciation errors for second-language learners of English. , 1999 .

[34]  Keikichi Hirose,et al.  A method for measuring the intelligibility and nonnativeness of phone quality in foreign language pronunciation training , 1998, ICSLP.