Acoustic and pronunciation model adaptation for context-independent and context-dependent pronunciation variability of non-native speech

In this paper, we propose an acoustic and pronunciation model adaptation method for context-independent (CI) and context-dependent (CD) pronunciation variability to improve the performance of a non-native automatic speech recognition (ASR) system. The proposed adaptation method is performed in three steps. First, we perform phone recognition to obtain an n-best list of phoneme sequences and derive pronunciation variant rules by using a decision tree. Second, the pronunciation variant rules are decomposed into CI and CD pronunciation variation on the basis of context dependency. That is, some pronunciation variant rules that are dedicated to the specific phoneme sequences is classified into CI pronunciation variation, but others are classified into CD one. It is assumed here that CI and CD pronunciation variabilities are invoked by a different pronunciation space from the mother tongue of a non-native speaker and the coarticulation effects in a context, respectively. Third, the acoustic model adaptation is performed in a state-tying step for the CI pronunciation variability from an indirect data-driven method. In addition, the pronunciation model adaptation is completed by constructing a multiple pronunciation dictionary using the CD pronunciation variability. It is shown from the continuous Korean-English ASR experiments that the proposed method can reduce the average word error rate (WER) by 16.02% when compared with the baseline ASR system that is trained by native speech. Moreover, an ASR system using the proposed method provides average WER reductions of 8.95% and 3.67% when compared to the only acoustic model adaptation and the only pronunciation model adaptation, respectively.

[1]  Helmer Strik,et al.  Modeling pronunciation variation for ASR: A survey of the literature , 1999, Speech Commun..

[2]  Hong Kook Kim,et al.  Acoustic Model Adaptation Based on Pronunciation Variability Analysis for Non-Native Speech Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[3]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[4]  Elmar Nöth,et al.  Adaptation in the pronunciation space for non-native speech recognition , 2004, INTERSPEECH.

[5]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[6]  Ralf Kompe,et al.  Generating non-native pronunciation variants for lexicon adaptation , 2004, Speech Commun..

[7]  Eric Fosler-Lussier,et al.  Multi-level decision trees for static and dynamic pronunciation models , 1999, EUROSPEECH.

[8]  Hong Kook Kim,et al.  Non-native pronunciation variation modeling using an indirect data driven method , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[9]  John J. Morgan,et al.  Making a Speech Recognizer Tolerate Non-native Speech through Gaussian Mixture Merging , 2004 .

[10]  Antoine Raux,et al.  Automated lexical adaptation and speaker clustering based on pronunciation habits for non-native speech recognition , 2004, INTERSPEECH.

[11]  Irina Illina,et al.  Combined acoustic and pronunciation modelling for non-native speech recognition , 2007, INTERSPEECH.

[12]  Dirk Van Compernolle Recognizing speech of goats, wolves, sheep and ... non-natives , 2001, Speech Commun..

[13]  Arun C. Surendran,et al.  DATA-DRIVEN PRONUNCIATION MODELLING FOR NON-NATIVE SPEAKERS USING ASSOCIATION STRENGTH BETWEEN PHONES , 2005 .

[14]  Satoshi Nakamura,et al.  A statistical lexicon for non-native speech recognition , 2004, INTERSPEECH.

[15]  J. Bellegarda An Overview of Statistical Language Model Adaptation , 2001 .