论文信息 - Toward Robust Mispronunciation Detection via Audio-Visual Speech Recognition

Toward Robust Mispronunciation Detection via Audio-Visual Speech Recognition

A recent trend in language learning is gamification, i.e. the application of game-design elements and game principles in non-game contexts. A key component therein is the detection of mispronunciations by means of automatic speech recognition. Constraints like quiet environments and the use of close-talking microphones hinder the applicability for language learning games.

Dorothea Kolossa | Steffen Zeiler | Mahdie Karbasi | Jan Freiwald

[1] Deryle W. Lonsdale,et al. Elicited Imitation as an Oral Proficiency Measure with ASR Scoring , 2008, LREC.

[2] Horacio Franco,et al. Automatic detection of mispronunciation for language instruction , 1997, EUROSPEECH.

[3] Laurenz Wiskott,et al. Utilizing Slow Feature Analysis for Lipreading , 2018, ITG Symposium on Speech Communication.

[4] Kun Li,et al. Mispronunciation Detection and Diagnosis in L2 English Speech Using Multidistribution Deep Neural Networks , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5] Jon Barker,et al. An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[6] Guy J. Brown,et al. Robust audiovisual speech recognition using noise-adaptive linear discriminant analysis , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Yong Wang,et al. Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers , 2015, Speech Commun..

[8] Ahmed Hussen Abdelaziz. Comparing Fusion Models for DNN-Based Audiovisual Continuous Speech Recognition , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[9] Lin-Shan Lee,et al. Improved approaches of modeling and detecting Error Patterns with empirical analysis for Computer-Aided Pronunciation Training , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Sherif Abdou,et al. Detection of specific mispronunciations using audiovisual features , 2010, AVSP.

[11] Kevin P. Murphy,et al. Dynamic Bayesian Networks for Audio-Visual Speech Recognition , 2002, EURASIP J. Adv. Signal Process..

[12] Shrikanth S. Narayanan,et al. Using Articulatory Representations to Detect Segmental Errors in Nonnative Pronunciation , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[13] Jeff A. Bilmes,et al. Hidden-articulator Markov models for speech recognition , 2003, Speech Commun..

[14] Wei Li,et al. Improving Mispronunciation Detection for Non-Native Learners with Multisource Information and LSTM-Based Deep Models , 2017, INTERSPEECH.

[15] James R. Glass,et al. Mispronunciation detection via dynamic time warping on deep belief network-based posteriorgrams , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[16] Sherif Abdou,et al. Audio-visual phoneme classification for pronunciation training applications , 2007, INTERSPEECH.

[17] Tim Fingscheidt,et al. Turbo Automatic Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18] James R. Glass,et al. A Comparison-based Approach to Mispronunciation Detection by , 2012 .

[19] Helmer Strik,et al. Automatic pronunciation error detection: an acoustic-phonetic approach , 2004 .

[20] Lin-Shan Lee,et al. Supervised Detection and Unsupervised Discovery of Pronunciation Error Patterns for Computer-Assisted Language Learning , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[21] Dorothea Kolossa,et al. Learning Dynamic Stream Weights For Coupled-HMM-Based Audio-Visual Speech Recognition , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[22] Robert M. Nickel,et al. Dynamic Stream Weighting for Turbo-Decoding-Based Audiovisual ASR , 2016, INTERSPEECH.