Mispronunciation detection via dynamic time warping on deep belief network-based posteriorgrams

In this paper, we explore the use of deep belief network (DBN) posteriorgrams as input to our previously proposed comparison-based system for detecting word-level mispronunciation. The system works by aligning a nonnative utterance with at least one native utterance and extracting features that describe the degree of mis-alignment from the aligned path and the distance matrix. We report system performance under different DBN training scenarios: pre-training and fine-tuning with either native data only or both native and nonnative data. Experimental results have shown that by substituting the system input from MFCC or Gaussian posteriorgrams obtained in a fully unsupervised manner to DBN posteriorgrams, the system performance can be improved by at least 10.4% relatively. Moreover, the system performance remains steady when only 30% of the annotations being used.

[1]  D. Kewley-Port,et al.  Speaker-dependent speech recognition as the basis for a speech training aid , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  James R. Glass,et al.  A Comparison-based Approach to Mispronunciation Detection by , 2012 .

[3]  Yuen Yee Lo,et al.  Deriving salient learners’ mispronunciations from cross-language phonological comparisons , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[4]  Ryohei Nakatsu,et al.  Automatic evaluation of English pronunciation based on speech recognition techniques , 1989, EUROSPEECH.

[5]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[6]  P. Lewis Ethnologue : languages of the world , 2009 .

[7]  Yu Hu,et al.  A new method for mispronunciation detection using Support Vector Machine based on Pronunciation Space Models , 2009, Speech Commun..

[8]  Steve J. Young,et al.  Phone-level pronunciation scoring and assessment for interactive language learning , 2000, Speech Commun..

[9]  Mitchell Peabody,et al.  Methods for pronunciation assessment in computer aided language learning , 2011 .

[10]  Sarah L. Nesbeitt Ethnologue: Languages of the World , 1999 .

[11]  Horacio Franco,et al.  Automatic detection of phone-level mispronunciation for language learning , 1999, EUROSPEECH.

[12]  Frank K. Soong,et al.  Discriminative acoustic model for improving mispronunciation detection and diagnosis in computer-aided pronunciation training (CAPT) , 2010, INTERSPEECH.

[13]  Ruslan Salakhutdinov,et al.  Learning Deep Generative Models , 2009 .

[14]  Maxine Eskénazi,et al.  An overview of spoken language technology for education , 2009, Speech Commun..

[15]  Helmer Strik,et al.  Comparing different approaches for automatic pronunciation error detection , 2009, Speech Commun..

[16]  James R. Glass,et al.  Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[17]  Geoffrey E. Hinton,et al.  Deep Belief Networks for phone recognition , 2009 .

[18]  Frank K. Soong,et al.  The Use of DBN-HMMs for Mispronunciation Detection and Diagnosis in L2 English to Support Computer-Aided Pronunciation Training , 2012, INTERSPEECH.

[19]  Lan Wang,et al.  Improving mispronunciation detection and diagnosis of learners' speech with context-sensitive phonological rules based on language transfer , 2008, INTERSPEECH.

[20]  Hung-An Chang,et al.  Resource configurable spoken query detection using Deep Boltzmann Machines , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).