论文信息 - Mispronunciation detection via dynamic time warping on deep belief network-based posteriorgrams

Mispronunciation detection via dynamic time warping on deep belief network-based posteriorgrams

In this paper, we explore the use of deep belief network (DBN) posteriorgrams as input to our previously proposed comparison-based system for detecting word-level mispronunciation. The system works by aligning a nonnative utterance with at least one native utterance and extracting features that describe the degree of mis-alignment from the aligned path and the distance matrix. We report system performance under different DBN training scenarios: pre-training and fine-tuning with either native data only or both native and nonnative data. Experimental results have shown that by substituting the system input from MFCC or Gaussian posteriorgrams obtained in a fully unsupervised manner to DBN posteriorgrams, the system performance can be improved by at least 10.4% relatively. Moreover, the system performance remains steady when only 30% of the annotations being used.

James R. Glass | Yaodong Zhang | Ann Lee

[1] D. Kewley-Port,et al. Speaker-dependent speech recognition as the basis for a speech training aid , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2] James R. Glass,et al. A Comparison-based Approach to Mispronunciation Detection by , 2012 .

[3] Yuen Yee Lo,et al. Deriving salient learners’ mispronunciations from cross-language phonological comparisons , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[4] Ryohei Nakatsu,et al. Automatic evaluation of English pronunciation based on speech recognition techniques , 1989, EUROSPEECH.

[5] Yee Whye Teh,et al. A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[6] P. Lewis. Ethnologue : languages of the world , 2009 .

[7] Yu Hu,et al. A new method for mispronunciation detection using Support Vector Machine based on Pronunciation Space Models , 2009, Speech Commun..

[8] Steve J. Young,et al. Phone-level pronunciation scoring and assessment for interactive language learning , 2000, Speech Commun..

[9] Mitchell Peabody,et al. Methods for pronunciation assessment in computer aided language learning , 2011 .

[10] Sarah L. Nesbeitt. Ethnologue: Languages of the World , 1999 .

[11] Horacio Franco,et al. Automatic detection of phone-level mispronunciation for language learning , 1999, EUROSPEECH.

[12] Frank K. Soong,et al. Discriminative acoustic model for improving mispronunciation detection and diagnosis in computer-aided pronunciation training (CAPT) , 2010, INTERSPEECH.

[13] Ruslan Salakhutdinov,et al. Learning Deep Generative Models , 2009 .

[14] Maxine Eskénazi,et al. An overview of spoken language technology for education , 2009, Speech Commun..

[15] Helmer Strik,et al. Comparing different approaches for automatic pronunciation error detection , 2009, Speech Commun..

[16] James R. Glass,et al. Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[17] Geoffrey E. Hinton,et al. Deep Belief Networks for phone recognition , 2009 .

[18] Frank K. Soong,et al. The Use of DBN-HMMs for Mispronunciation Detection and Diagnosis in L2 English to Support Computer-Aided Pronunciation Training , 2012, INTERSPEECH.

[19] Lan Wang,et al. Improving mispronunciation detection and diagnosis of learners' speech with context-sensitive phonological rules based on language transfer , 2008, INTERSPEECH.

[20] Hung-An Chang,et al. Resource configurable spoken query detection using Deep Boltzmann Machines , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).