Integration of multilayer regression analysis with structure-based pronunciation assessment

Abstract Automatic pronunciation assessment has several difficulties.Adequacy in controlling the vocal organs is often estimatedfrom the spectral envelopes of input utterances but the envelopepatterns are also affected by other factors such as speaker iden-tity. Recently, a new method of speech representation was pro-posed where these non-linguistic variations are effectively re-moved through modeling only the contrastive aspects of speechfeatures. This speech representation is called speech struc-ture. However, the often excessively high dimensionality ofthe speech structure can degrade the performance of structure-based pronunciation assessment. To deal with this problem, weintegratemultilayerregressionanalysiswiththestructure-basedassessment. The results show higher correlation between hu-man and machine scores and also show much higher robustnessto speaker differences compared to widely used GOP-basedanalysis.Index Terms: CALL, speech structure, regression, GOP 1. Introduction Automatic pronunciation assessment is a task used to evalu-ate only the linguistic aspect of utterances. However, speechfeatures inevitably include acoustic variations caused by non-linguistic factors such as the speaker, communication chan-nel and noise. The same pronunciation can lead to differentacoustic observations due to different speakers and differentenvironments. To deal with these variations, modern pronun-ciation assessment approaches mainly make use of statisticalmethods to model the distributions of the acoustic features [1].These methods can achieve relatively high performance whenthere is a good match between training and testing conditions.Buttheirperformancealwaysdegradessignificantlywhentheseconditions are mismatched. In Automatic Speech Recogni-tion (ASR), speaker adaptation techniques have proved effec-tive at reducing mismatches. However, if the acoustic modelsused in pronunciation assessment are adapted to learners, in-correct pronunciations might be recognized as correct due toover-adaptation [2].To solve the mismatch problem, the third author of thispaper proposed a new speech representation, called speechstructure, which aims at removing the non-linguistic factorsin speech features [3]. In contrast to classical speech models,speech structures make use of

[1]  Steve J. Young,et al.  Phone-level pronunciation scoring and assessment for interactive language learning , 2000, Speech Commun..

[2]  Nobuaki Minematsu,et al.  English Speech Database Read by Japanese Learners for CALL System Development , 2002, LREC.

[3]  Hideki Kawahara,et al.  STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds , 2006 .

[4]  Hermann Ney,et al.  Vocal tract normalization equals linear transformation in cepstral space , 2001, IEEE Transactions on Speech and Audio Processing.

[5]  Keikichi Hirose,et al.  Analysis and utilization of MLLR speaker adaptation technique for learners' pronunciation evaluation , 2009, INTERSPEECH.

[6]  Susan R. Hertz A model of the regularities underlying speaker variation: evidence from hybrid synthesis , 2006, INTERSPEECH.

[7]  Masato Akagi,et al.  Speaker individualities in speech spectral envelopes , 1994, ICSLP.

[8]  Nobuaki Minematsu Yet another acoustic representation of speech sounds , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  Keikichi Hirose,et al.  Structural analysis of dialects, sub-dialects and sub-sub-dialects of Chinese , 2009, INTERSPEECH.

[10]  Nobuaki Minematsu,et al.  F-divergence Is a Generalized Invariant Measure between Distributions , 2008, INTERSPEECH.

[11]  Nobuaki Minematsu,et al.  A Study on Invariance of $f$-Divergence and Its Application to Speech Recognition , 2010, IEEE Transactions on Signal Processing.

[12]  Keikichi Hirose,et al.  Optimal event search using a structural cost function - improvement of structure to speech conversion , 2009, INTERSPEECH.

[13]  Keikichi Hirose,et al.  Sub-structure-based estimation of pronunciation proficiency and classification of learners , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[14]  Keikichi Hirose,et al.  Structural assessment of language learners' pronunciation , 2007, INTERSPEECH.