Sub-structure-based estimation of pronunciation proficiency and classification of learners

Automatic estimation of pronunciation proficiency has its specific difficulty. Adequacy in controlling the vocal organs can be estimated from spectral envelopes of input utterances but the envelope patterns are also affected easily by different speakers. To develop a pedagogically sound method for automatic estimation, the envelope changes caused by linguistic factors and those by extra-linguistic factors should be properly separated. For this aim, in our previous study [1], we proposed a mathematically-guaranteed and linguistically-valid speaker-invariant representation of pronunciation, called speech structure. After the proposal, we have examined that representation also for ASR [2], [3], [4] and, through these works, we have learned better how to apply speech structures to various tasks. In this paper, we focus on a proficiency estimation experiment done in [1] and, based on our recently proposed techniques for the structures, we carry out that experiment again but under new and different conditions. Here, we use smaller units of structural analysis, speaker-invariant substructures, and relative structural distances between a learner and a teacher. Results show that correlations between human and machine rating are improved and also show extremely higher robustness to speaker differences compared to widely used GOP scores. Further, we also demonstrate that the proposed representation can classify learners purely based on their pronunciation proficiency, not affected by their age and gender.

[1]  Keikichi Hirose,et al.  Structural analysis of dialects, sub-dialects and sub-sub-dialects of Chinese , 2009, INTERSPEECH.

[2]  Nobuaki Minematsu,et al.  F-divergence Is a Generalized Invariant Measure between Distributions , 2008, INTERSPEECH.

[3]  Keikichi Hirose,et al.  Multi-stream parameterization for structural speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Keikichi Hirose,et al.  Implementation of Robust Speech Recognition by Simulating Infants' Speech Perception Based on the Invariant Sound Shape Embedded in Utterances , 2009 .

[5]  Koichi Shinoda,et al.  Rapid vocal tract length normalization using maximum likelihood estimation , 2001, INTERSPEECH.

[6]  Linda R. Waugh,et al.  The Sound Shape of Language , 1979 .

[7]  Nobuaki Minematsu,et al.  Development of English Speech Database Read by Japanese to Support CALL Research , 2004 .

[8]  Hermann Ney,et al.  Vocal tract normalization equals linear transformation in cepstral space , 2001, IEEE Transactions on Speech and Audio Processing.

[9]  Martin J. Russell,et al.  Challenges for computer recognition of children2s speech , 2007, SLaTE.

[10]  Keikichi Hirose,et al.  STRUCTURAL REPRESENTATION OF THE PRONUNCIATION AND ITS USE FOR CALL , 2006, 2006 IEEE Spoken Language Technology Workshop.

[11]  Steve J. Young,et al.  Phone-level pronunciation scoring and assessment for interactive language learning , 2000, Speech Commun..

[12]  Nobuaki Minematsu,et al.  Random discriminant structure analysis for automatic recognition of connected vowels , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[13]  Nobuaki Minematsu Are learners myna birds to the averaged distributions of native speakers? - a note ofwarning from a serious speech engineer - , 2007, SLaTE.

[14]  W. Labov,et al.  The Atlas Of North American English , 2005 .

[15]  Nobuaki Minematsu Pronunciation assessment based upon the phonological distortions observed in language learners' utterances , 2004, INTERSPEECH.

[16]  Keikichi Hirose,et al.  Directional dependency of cepstrum on vocal tract length , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Keikichi Hirose,et al.  Structural representation of pronunciation and its application for classifying Japanese learners of English , 2007, SLaTE.