Pronunciation assessment based on multilayer multiple regression analysis using structural features

In the rapid internationalization and informatization, many research efforts have been made to build computer-aided language learning (CALL) systems. Good pronunciation assessment systems should be built using the technologies which can deal with acoustic variabilities found in learners’ utterances caused by non-linguistic factors such as age and gender. However, the widely-used acoustic modeling technique of HMM often shows unstable performances with speakers of different ages and genders. Recently, a new method of representing learners’ pronunciations with their non-linguistic features effectively removed, called pronunciation structure. In this method, only the contrastive features of speech are extracted. However, the excessively high dimensionality of the structure comes to degrade its performance and, to solve this problem, multilayer regression analysis with structural features is proposed in this paper. The results show much higher correlation between human and machine performances of assessing learners’ pronunciations compared to the previously proposed structure-based method. Further, the proposed method shows much higher robustness compared to the widely-used HMM-based method. In this paper, we also propose a good combination of the structure and the HMM.

[1]  Akinori Ito,et al.  A speaker adaptation method for non-native speech using learners' native utterances for computer-assisted language learning systems , 2009, Speech Commun..

[2]  Hideki Kawahara,et al.  STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds , 2006 .

[3]  Steve J. Young,et al.  Phone-level pronunciation scoring and assessment for interactive language learning , 2000, Speech Commun..

[4]  Keikichi Hirose,et al.  Sub-structure-based estimation of pronunciation proficiency and classification of learners , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[5]  Kiyoshi Honda,et al.  Individual variation of the hypopharyngeal cavities and its acoustic effects , 2005 .

[6]  Keikichi Hirose,et al.  On invariant structural representation for speech recognition: theoretical validation and experimental improvement , 2009, INTERSPEECH.

[7]  Nobuaki Minematsu,et al.  A Study on Invariance of $f$-Divergence and Its Application to Speech Recognition , 2010, IEEE Transactions on Signal Processing.

[8]  Martin J. Russell,et al.  Challenges for computer recognition of children2s speech , 2007, SLaTE.

[9]  Nobuaki Minematsu Yet another acoustic representation of speech sounds , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Keikichi Hirose,et al.  Analysis and utilization of MLLR speaker adaptation technique for learners' pronunciation evaluation , 2009, INTERSPEECH.