ONLINE UNSUPERVISED MULTILINGUAL ACOUSTIC MODEL ADAPTATION FOR NONNATIVE ASR

Automatic speech recognition (ASR) is currently one of the main research interests in computer science. Hence, many ASR systems are available in the market. Yet, the performance of speech and language recognition systems is poor on nonnative speech. The challenge for nonnative speech recognition is to maximize the accuracy of a speech recognition system when only a small amount of nonnative data is available. Recent studies on nonnative speech recognition were focus on supervised context in which spoken languages (L2) and speakers’ mother tongue languages (L1) are known in advance. In this paper, we want to study the adaptation approach of nonnative speech in which both L1 and L2 are unknown in advance. Such new approach is called online-unsupervised multilingual acoustic model adaptation. Thus, “unsupervised” means we don’t know in advance the nonnative speech utterance (it’s L1 and L2); and “online” means the adaptation is made during the decoding. Thus, the proposed approach decomposes into two stages. The first stage, contained language observer module, aims to recover the linguistic information (spoken languages and the origins of the speakers) of the unknown speech utterances to be decoded. The second stage is to adapt the multilingual acoustic model based on knowledge provided by language observer module. It is clear that the multilingual acoustic model must contain the acoustic units of L2 and L1. In this study, we report on the acoustic model adaptation for improving the recognition of nonnative speech in English, French and Vietnamese, spoken by speakers of different origins. Results degradation around 7% of baseline systems’ phone error rates (PERs) obtained from the experiments demonstrate the feasibility of the method.

[1]  Bin Ma,et al.  A Vector Space Modeling Approach to Spoken Language Identification , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[3]  Tanja Schultz,et al.  Multilingual Speech Processing , 2006 .

[4]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[5]  Bin Ma,et al.  A phonotactic-semantic paradigm for automatic spoken document classification , 2005, SIGIR '05.

[6]  Jean-François Serignat,et al.  Spoken and Written Language Resources for Vietnamese , 2004, LREC.

[7]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[8]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[9]  Tanja Schultz,et al.  Comparison of acoustic model adaptation techniques on non-native speech , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[10]  J. Flege,et al.  Amount of native-language (L1) use affects the pronunciation of an L2 , 1997 .

[11]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[12]  Tien Ping Tan,et al.  Modeling context and language variation for non-native speech recognition , 2007, INTERSPEECH.

[13]  Maxine Eskénazi,et al.  BREF, a large vocabulary spoken corpus for French , 1991, EUROSPEECH.