A Hybrid Acoustic and Pronunciation Model Adaptation Approach for Non-native Speech Recognition

In this paper, we propose a hybrid model adaptation approach in which pronunciation and acoustic models are adapted by incorporating the pronunciation and acoustic variabilities of non-native speech in order to improve the performance of non-native automatic speech recognition (ASR). Specifically, the proposed hybrid model adaptation can be performed at either the state-tying or triphone-modeling level, depending at which acoustic model adaptation is performed. In both methods, we first analyze the pronunciation variant rules of non-native speakers and then classify each rule as either a pronunciation variant or an acoustic variant. The state-tying level hybrid method then adapts pronunciation models and acoustic models by accommodating the pronunciation variants in the pronunciation dictionary and by clustering the states of triphone acoustic models using the acoustic variants, respectively. On the other hand, the triphone-modeling level hybrid method initially adapts pronunciation models in the same way as in the state-tying level hybrid method; however, for the acoustic model adaptation, the triphone acoustic models are then re-estimated based on the adapted pronunciation models and the states of the re-estimated triphone acoustic models are clustered using the acoustic variants. From the Korean-spoken English speech recognition experiments, it is shown that ASR systems employing the state-tying and triphone-modeling level adaptation methods can relatively reduce the average word error rates (WERs) by 17.1% and 22.1% for non-native speech, respectively, when compared to a baseline ASR system.

[1]  Hong Kook Kim,et al.  Non-native pronunciation variation modeling using an indirect data driven method , 2007, 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU).

[2]  Irina Illina,et al.  Combined acoustic and pronunciation modelling for non-native speech recognition , 2007, INTERSPEECH.

[3]  Dirk Van Compernolle Recognizing speech of goats, wolves, sheep and ... non-natives , 2001, Speech Commun..

[4]  Dietrich Klakow,et al.  Testing the correlation of word error rate and perplexity , 2002, Speech Commun..

[5]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[6]  Hong Kook Kim,et al.  Acoustic and pronunciation model adaptation for context-independent and context-dependent pronunciation variability of non-native speech , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[8]  Helmer Strik,et al.  Modeling pronunciation variation for ASR: A survey of the literature , 1999, Speech Commun..

[9]  Ralf Kompe,et al.  Generating non-native pronunciation variants for lexicon adaptation , 2004, Speech Commun..

[10]  Hong Kook Kim,et al.  Acoustic Model Adaptation Based on Pronunciation Variability Analysis for Non-Native Speech Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[11]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[12]  Elmar Nöth,et al.  Adaptation in the pronunciation space for non-native speech recognition , 2004, INTERSPEECH.

[13]  Satoshi Nakamura,et al.  A statistical lexicon for non-native speech recognition , 2004, INTERSPEECH.

[14]  Arun C. Surendran,et al.  DATA-DRIVEN PRONUNCIATION MODELLING FOR NON-NATIVE SPEAKERS USING ASSOCIATION STRENGTH BETWEEN PHONES , 2005 .

[15]  Antoine Raux,et al.  Automated lexical adaptation and speaker clustering based on pronunciation habits for non-native speech recognition , 2004, INTERSPEECH.

[16]  Eric Fosler-Lussier,et al.  Multi-level decision trees for static and dynamic pronunciation models , 1999, EUROSPEECH.

[17]  John J. Morgan,et al.  Making a Speech Recognizer Tolerate Non-native Speech through Gaussian Mixture Merging , 2004 .