Context-Dependent Duration Modeling with Backoff Strategy and Look-Up Tables for Pronunciation Assessment and Mispronunciation Detection

This paper makes an intensive study on the contextual modeling methods of duration information, for the purpose of improving the performance of pronunciation assessment and mispronunciation detection. The main ideas include: 1) we extend the relations among duration sequence with different level of contextual constraints, and bring them into a unified framework. 2) A backoff mechanism is introduced to resolve the problem of data sparseness and unbalanced distribution. 3) Rather than the traditional parametric functions, we use the discrete modeling for empirical duration distributions based on look-up tables, which can improve the model precision and accelerate the computation speed. The experimental results show the effectiveness of the above methods. The proposed word-dependent duration models can yield 0.0782 in absolute CC (correlation coefficient) improvement and 4.58% in absolute EER (equal error rate) reduction for the tasks of pronunciation assessment and mispronunciation detection respectively, both compared with the baseline method with conventional context-independent case.

[1]  Bo Xu,et al.  Trigram duration modeling in speech recognition , 2004, 2004 International Symposium on Chinese Spoken Language Processing.

[2]  Krzysztof Marasek,et al.  Phone-duration-based confidence measures for embedded applications , 2000, INTERSPEECH.

[3]  Jean-Luc Gauvain,et al.  Modeling Duration via Lattice Rescoring , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[4]  Ronald Rosenfeld,et al.  Statistical language modeling using the CMU-cambridge toolkit , 1997, EUROSPEECH.

[5]  Bo Xu,et al.  High performance automatic mispronunciation detection method based on neural network and TRAP features , 2009, INTERSPEECH.

[6]  Daniel Povey Phone duration modeling for LVCSR , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Mitch Weintraub,et al.  Automatic scoring of pronunciation quality , 2000, Speech Commun..

[8]  John H. L. Hansen,et al.  A duration-based confidence measure for automatic segmentation of noise corrupted speech , 1998, ICSLP.

[9]  David Burshtein Robust parametric modeling of durations in hidden Markov models , 1996, IEEE Trans. Speech Audio Process..

[10]  Pietro Laface,et al.  Word confidence using duration models , 2009, INTERSPEECH.

[11]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[12]  Bo Xu,et al.  An efficient mispronounciation detction method using GLDS-SVM and formant enhanced features , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Yusuke Kida,et al.  Using duration and pitch for mandarin digit string recognition , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Wai Kit Lo,et al.  Statistical phone duration modeling to filter for intact utterances in a computer-assisted pronunciation training system , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.