Toward Robust Mispronunciation Detection via Audio-Visual Speech Recognition

A recent trend in language learning is gamification, i.e. the application of game-design elements and game principles in non-game contexts. A key component therein is the detection of mispronunciations by means of automatic speech recognition. Constraints like quiet environments and the use of close-talking microphones hinder the applicability for language learning games.

[1]  Deryle W. Lonsdale,et al.  Elicited Imitation as an Oral Proficiency Measure with ASR Scoring , 2008, LREC.

[2]  Horacio Franco,et al.  Automatic detection of mispronunciation for language instruction , 1997, EUROSPEECH.

[3]  Laurenz Wiskott,et al.  Utilizing Slow Feature Analysis for Lipreading , 2018, ITG Symposium on Speech Communication.

[4]  Kun Li,et al.  Mispronunciation Detection and Diagnosis in L2 English Speech Using Multidistribution Deep Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5]  Jon Barker,et al.  An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[6]  Guy J. Brown,et al.  Robust audiovisual speech recognition using noise-adaptive linear discriminant analysis , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Yong Wang,et al.  Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers , 2015, Speech Commun..

[8]  Ahmed Hussen Abdelaziz Comparing Fusion Models for DNN-Based Audiovisual Continuous Speech Recognition , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9]  Lin-Shan Lee,et al.  Improved approaches of modeling and detecting Error Patterns with empirical analysis for Computer-Aided Pronunciation Training , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Sherif Abdou,et al.  Detection of specific mispronunciations using audiovisual features , 2010, AVSP.

[11]  Kevin P. Murphy,et al.  Dynamic Bayesian Networks for Audio-Visual Speech Recognition , 2002, EURASIP J. Adv. Signal Process..

[12]  Shrikanth S. Narayanan,et al.  Using Articulatory Representations to Detect Segmental Errors in Nonnative Pronunciation , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Jeff A. Bilmes,et al.  Hidden-articulator Markov models for speech recognition , 2003, Speech Commun..

[14]  Wei Li,et al.  Improving Mispronunciation Detection for Non-Native Learners with Multisource Information and LSTM-Based Deep Models , 2017, INTERSPEECH.

[15]  James R. Glass,et al.  Mispronunciation detection via dynamic time warping on deep belief network-based posteriorgrams , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16]  Sherif Abdou,et al.  Audio-visual phoneme classification for pronunciation training applications , 2007, INTERSPEECH.

[17]  Tim Fingscheidt,et al.  Turbo Automatic Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18]  James R. Glass,et al.  A Comparison-based Approach to Mispronunciation Detection by , 2012 .

[19]  Helmer Strik,et al.  Automatic pronunciation error detection: an acoustic-phonetic approach , 2004 .

[20]  Lin-Shan Lee,et al.  Supervised Detection and Unsupervised Discovery of Pronunciation Error Patterns for Computer-Assisted Language Learning , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21]  Dorothea Kolossa,et al.  Learning Dynamic Stream Weights For Coupled-HMM-Based Audio-Visual Speech Recognition , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22]  Robert M. Nickel,et al.  Dynamic Stream Weighting for Turbo-Decoding-Based Audiovisual ASR , 2016, INTERSPEECH.