A speaker adaptation method for non-native speech using learners' native utterances for computer-assisted language learning systems

In recent years, various CALL systems which can evaluate a learner's pronunciation using speech recognition technology have been proposed. In order to evaluate a learner's utterances and point out problems with higher accuracy, speaker adaptation is a promising technology. However, many learners who use the CALL system often have very poor speaking ability in the target language (L2), so conventional speaker adaptation methods have problems because they require the learners' correctly-pronounced L2 utterances for adaptation. In this paper, we propose two new types of speaker adaptation methods for the CALL system. The new methods only require the learners' utterances in their native language (L1) for adapting the acoustic model for L2. The first method is an algorithm to adapt acoustic models using a bilingual speaker's utterances. The speaker-independent acoustic models of L1 and L2 are adapted to the bilingual speaker once, then they are adapted to the learner again using the learner's L1 utterances. Using this method, we obtained about 5-point higher phoneme recognition accuracy than the baseline method. The second method is a training algorithm of a set of acoustic models based on speaker adaptive training. It can robustly train bilinguals' models using a few utterances in L1 and L2 uttered by bilingual speakers. Using this method, we obtained about 10-point higher phoneme recognition accuracy than the baseline method.

[1]  Tomoki Toda,et al.  Evaluation of cross-language voice conversion using bilingual and non-bilingual databases , 2002, INTERSPEECH.

[2]  Elmar Nöth,et al.  Adaptation in the pronunciation space for non-native speech recognition , 2004, INTERSPEECH.

[3]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[4]  Shigeki Sagayama,et al.  Speaker adaptation based on transfer vector field smoothing with continuous mixture density HMMs , 1992, ICSLP.

[5]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[6]  Jean-Luc Gauvain,et al.  Speaker adaptation based on MAP estimation of HMM parameters , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Shozo Makino,et al.  An Evaluation Method of Japanese Pronunciation for Korean Native Speakers , 2004 .

[8]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[9]  Shozo Makino,et al.  Automatic Detection of English Mispronunciation Using Speaker Adaptation and Automatic Assessment of English Intonation and Rhythm , 2006 .

[10]  Mervyn A. Jack,et al.  SPELL: An automated system for computer-aided pronunciation teaching , 1993, Speech Commun..

[11]  Steve Young,et al.  Computer-assisted pronunciation teaching based on automatic speech recognition , 1997 .

[12]  Steve J. Young,et al.  Estimation of models for non-native speech in computer-assisted language learning based on linear model combination , 1998, ICSLP.

[13]  Patricia Dunkel,et al.  Computerized Testing of Nonparticipatory L2 Listening Comprehension Proficiency: An ESL Prototype Development Effort , 1991 .

[14]  Akinori Ito,et al.  Speaker Adaptation of Bilingual Phone Models using Bilingual Speakers' Speech , 2003 .

[15]  Dorothy M. Chun,et al.  Project CyberBuch : a hypermedia approach to computer-assisted language learning , 1995 .

[16]  Keikichi Hirose,et al.  Teaching the pronunciation of Japanese double-mora phonemes using speech recognition technology , 2000, Speech Commun..