Vowel mispronunciation detection using DNN acoustic models with cross-lingual training

We address the automatic detection of phone-level mispronunciation for feedback in a computer-aided language learning task where the target language data (Indian English) is limited. Based on the recent success of DNN acoustic models on limited resource recognition tasks, we compare different methods of utilizing the limited target language data in the training of acoustic models that are initialized with multilingual data. Frame-level DNN posteriors obtained by the different training methods are compared in a phone classification task with a baseline GMM/HMM system. A judicious use of domain knowledge in terms of L2 phonology and L1 interference, that includes influence on phone quality and duration, are applied to the design of confidence scores for mispronunciation detection of vowels of Indian English as spoken by Gujarati L1 learners. We also show that the pronunciation error detection system benefits from a more precise signal-based segmentation of the test speech vowels, as would be expected due to the now more reliable frame-based confidence scores.

[1]  J. Harnsberger,et al.  The Influence of Gujarati and Tamil L1s on Indian English: A Preliminary Study. , 2006 .

[2]  Steve J. Young,et al.  Phone-level pronunciation scoring and assessment for interactive language learning , 2000, Speech Commun..

[3]  Yong Wang,et al.  Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers , 2015, Speech Commun..

[4]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Steve Renals,et al.  Unsupervised cross-lingual knowledge transfer in DNN-based LVCSR , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[6]  Chiu-yu Tseng,et al.  Phonetic aspects of content design in AESOP (Asian English Speech cOrpus Project) , 2009, 2009 Oriental COCOSDA International Conference on Speech Database and Assessments.

[7]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[8]  Rasmus Berg Palm,et al.  Prediction as a candidate for learning deep hierarchical models of data , 2012 .

[9]  Vassilios Digalakis,et al.  Combination of machine scores for automatic grading of pronunciation quality , 2000, Speech Commun..

[10]  Joost van Doremalen,et al.  Using non-native error patterns to improve pronunciation verification , 2010, INTERSPEECH.

[11]  V. V. Yardi Teaching English Pure Vowels to the Marathi Learner: Some Suggestions , 1978 .

[12]  Ketan B Vyas A comparative study of English and Gujarati phonological systems , 2010 .

[13]  Kaisheng Yao,et al.  Adaptation of context-dependent deep neural networks for automatic speech recognition , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[14]  Yu Hu,et al.  A new method for mispronunciation detection using Support Vector Machine based on Pronunciation Space Models , 2009, Speech Commun..

[15]  Helmer Strik,et al.  Automatic pronunciation error detection in non-native speech: the case of vowel errors in Dutch. , 2013, The Journal of the Acoustical Society of America.

[16]  Hynek Hermansky,et al.  Multilingual MLP features for low-resource LVCSR systems , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Preeti Rao,et al.  Acoustic models for pronunciation assessment of vowels of Indian English , 2013, 2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE).

[18]  Steve Renals,et al.  Multilingual training of deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Dong Yu,et al.  Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[20]  P. V. S. Rao,et al.  Hindi speech database , 2000, INTERSPEECH.

[21]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[22]  Preeti Rao,et al.  Improving the robustness of phonetic segmentation to accent and style variation with a two-staged approach , 2009, INTERSPEECH.

[23]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .