Cost Reduction of Acoustic Modeling for Real-Environment Applications Using Unsupervised and Selective Training

Development of an ASR application such as a speech-oriented guidance system for a real environment is expensive. Most of the costs are due to human labeling of newly collected speech data to construct the acoustic model for speech recognition. Employment of existing models or sharing models across multiple applications is often difficult, because the characteristics of speech depend on various factors such as possible users, their speaking style and the acoustic environment. Therefore, this paper proposes a combination of unsupervised learning and selective training to reduce the development costs. The employment of unsupervised learning alone is problematic due to the task-dependency of speech recognition and because automatic transcription of speech is error-prone. A theoretically well-defined approach to automatic selection of high quality and task-specific speech data from an unlabeled data pool is presented. Only those unlabeled data which increase the model likelihood given the labeled data are employed for unsupervised training. The effectivity of the proposed method is investigated with a simulation experiment to construct adult and child acoustic models for a speech-oriented guidance system. A completely human-labeled database which contains real-environment data collected over two years is available for the development simulation. It is shown experimentally that the employment of selective training alleviates the problems of unsupervised learning, i.e. it is possible to select speech utterances of a certain speaker group but discard noise inputs and utterances with lower recognition accuracy. The simulation experiment is carried out for several selected combinations of data collection and human transcription period. It is found empirically that the proposed method is especially effective if only relatively few of the collected data can be labeled and transcribed by humans.

[1]  Tomoki Toda,et al.  Utterance-Based Selective Training for the Automatic Creation of Task-Dependent Acoustic Models , 2006, IEICE Trans. Inf. Syst..

[2]  Dilek Z. Hakkani-Tür,et al.  Active and unsupervised learning for automatic speech recognition , 2003, INTERSPEECH.

[3]  Douglas A. Reynolds,et al.  Speaker identification and verification using Gaussian mixture speaker models , 1995, Speech Commun..

[4]  Masafumi Nishida,et al.  Speaker indexing and adaptation using speaker clustering based on statistical model selection , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Hermann Ney,et al.  Unsupervised training of acoustic models for large vocabulary continuous speech recognition , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[6]  Liang Gu,et al.  Portability challenges in developing interactive dialogue systems , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[7]  Jean-Luc Gauvain,et al.  Lightly supervised and unsupervised acoustic model training , 2002, Comput. Speech Lang..

[8]  Gerard G. L. Meyer,et al.  Robustness aspects of active learning for acoustic modeling , 2004, INTERSPEECH.

[9]  Dilek Z. Hakkani-Tür,et al.  Active learning for automatic speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[11]  K. Shikano,et al.  Selective EM training of acoustic models based on sufficient statistics of single utterances , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[12]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[13]  Jean-Luc Gauvain,et al.  Lightly supervised acoustic model training using consensus networks , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[15]  Kiyohiro Shikano,et al.  Noise robust real world spoken dialogue system using GMM based rejection of unintended inputs , 2004, INTERSPEECH.

[16]  Shuichi Itahashi,et al.  JNAS: Japanese speech corpus for large vocabulary continuous speech recognition research , 1999 .

[17]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[18]  Ponani S. Gopalakrishnan,et al.  Clustering via the Bayesian information criterion with applications in speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[19]  Ryuichi Nisimura,et al.  Takemaru-kun : Speech-Oriented Information System for Real World Research Platform , 2003 .

[20]  Kiyohiro Shikano,et al.  A new phonetic tied-mixture model for efficient decoding , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).