UTTERANCE-BASED SELECTIVE TRAINING FOR COST-EFFECTIVE TASK-ADAPTATION OF ACOUSTIC MODELS

The construction of acoustic models for speech recognition systems is a very costly and time-consuming process, since their robust training requires large amounts of transcribed speech data, which have to be collected and labeled by humans. This paper describes an approach for costeffective construction of task-adapted acoustic models. Existing speech data(bases) are employed to set up a large training data pool. Apart from that, only a small amount of taskspecic speech data is required. Based on an algorithm for utterance-based selective training of acoustic models, training utterances are selected from the training data pool so that the likelihood of the acoustic model given the task-specic speech data is maximized. The proposed method is evaluated for acoustic models with context-independent and contextdependent phonetic units. Results are reported for building an infant (preschool children) acoustic model with speech from elementary school children and an elderly acoustic model with adult speech. The proposed approach is already effective if there are only 20 task-specic utterances available. A relative improvement in word accuracy of up to 10% is achieved over conventional acoustic model construction and up to 2.8% over MAP and MLLR adaptation with the task-specic data. The gap in performance to an acoustic model trained on large amounts of task-specic data was reduced up to 76%.

[1]  Tao Chen,et al.  Transformation and combination of hiden Markov models for speaker selection training , 2004, INTERSPEECH.

[2]  Jean-Luc Gauvain,et al.  Genericity and portability for task-independent speech recognition , 2005, Comput. Speech Lang..

[3]  Kiyohiro Shikano,et al.  Elderly acoustic model for large vocabulary continuous speech recognition , 2001, INTERSPEECH.

[4]  Ryuichi Nisimura,et al.  Takemaru-kun : Speech-Oriented Information System for Real World Research Platform , 2003 .

[5]  Kiyohiro Shikano,et al.  A new phonetic tied-mixture model for efficient decoding , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[6]  K. Shikano,et al.  Selective EM training of acoustic models based on sufficient statistics of single utterances , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[7]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[8]  Dilek Z. Hakkani-Tür,et al.  Active and unsupervised learning for automatic speech recognition , 2003, INTERSPEECH.

[9]  Dilek Z. Hakkani-Tür,et al.  Active learning for automatic speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Gerard G. L. Meyer,et al.  Robustness aspects of active learning for acoustic modeling , 2004, INTERSPEECH.

[11]  Alexander H. Waibel,et al.  Unsupervised training of a speech recognizer: recent experiments , 1999, EUROSPEECH.

[12]  Liang Gu,et al.  Portability challenges in developing interactive dialogue systems , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[13]  Kiyohiro Shikano,et al.  Evaluation on unsupervised speaker adaptation based on sufficient HMM statictics of selected speakers , 2001, INTERSPEECH.

[14]  Hermann Ney,et al.  Unsupervised training of acoustic models for large vocabulary continuous speech recognition , 2005, IEEE Transactions on Speech and Audio Processing.