How to judge reusability of existing speech corpora for target task by utilizing statistical multidimensional scaling

Abstract In order to develop a target speech recognition system with less cost of time and money, reusability of existing speech corpora is becoming one of the most important issues. This paper proposes a new technique to judge the reusability of existing speech corpora for a target task by utilizing a statistical multidimensional scaling method. In an experiment using twelve tasks in five speech corpora, our proposed method could show high correlation to the cross task recognition performance and judge the reusability of existing speech corpora correctly for the target task with lower cost. Index Terms : statistical MDS, reusability, acoustic model, task dependency 1. Introduction Recognition accuracy is still extremely sensitive to environmental conditions such as the speaker characteristic, the speaking style, the background noise and the task domain. These issues are called a task dependency. The task dependency has strong impact on a recognition performance of the Automatic Speech Recognition (ASR) in embedded appliances such as car-navigation systems, personal digital assistants and robots. In these appliances, processing power and available memory size are generally restricted at a cost-conscious point of view, as not only the ASR but also other applications are operating on a same platform. In such a case, a number of parameters contained in an acoustic model should be reduced. That's why the acoustic model cannot demonstrate enough performance even if it is trained from a huge speech corpus covering various tasks. So, an acoustic modeling optimized for a target task is expected. In recent research [4], a task dependency and reusability of four speech corpora have been investigated by a cross task recognition experiment. We can select speech data, which has closer acoustic characteristics, through a speech recognition experiment with a few target task speech data (development data). However, no one can judge whether the selected speech data is enough and has high reusability for the target task. If there is a technique of a judgment of the reusability for the target task, we can judge what we should invest the ASR system in. For instance, a collecting target speech data, an acoustic modeling, a language modeling, an evaluation and a system maintenance. In this paper, how to judge reusability of existing speech corpora for a target task is described. In an experiment, 12 tasks contained in 5 Japanese speech corpora are evaluated by a statistical multidimensional scaling (MDS) method called as COSMOS (COmprehensive Space Map of Objective Signal) method [5] that visualizes aggregate of speech data within two or three dimensional space. The visualization is acknowledged as an effective technique to grasp the multidimensional space that humans cannot understand easily. It is expected that to comprehend the relationship between a target task speech data and existing speech corpora by using the visualization of their acoustic space is effective in order to analyze reusability of existing speech corpora. In the next section, our proposed method is described. In Section 3, an overview of speech corpora is described. In Section 4, our proposed method is described and the effective ness is investigated trough a cross task recognition experiment. Finally, a summary and an outlook on a future work are given in Section 5.

[1]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[2]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[3]  Tetsuo Kosaka,et al.  Tree-structured speaker clustering for speaker-independent continuous speech recognition , 1994, ICSLP.

[4]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Jean-Luc Gauvain,et al.  Genericity and portability for task-independent speech recognition , 2005, Comput. Speech Lang..

[6]  M. Shozakai,et al.  Acoustic space analysis method utilizing statistical multidimensional scaling technique , 2005 .

[7]  Pavel Pudil,et al.  Introduction to Statistical Pattern Recognition , 2006 .

[8]  Kiyohiro Shikano,et al.  A speech enhancement approach E-CMN/CSS for speech recognition in car environments , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[9]  K. Shikano,et al.  Selective EM training of acoustic models based on sufficient statistics of single utterances , 2005, IEEE Workshop on Automatic Speech Recognition and Understanding, 2005..

[10]  Kohji Fukunaga,et al.  Introduction to Statistical Pattern Recognition-Second Edition , 1990 .

[11]  Makoto Shozakai,et al.  Analyzing reusability of speech corpus based on statistical multidimensional scaling method , 2006, INTERSPEECH.

[12]  Makoto Shozakai,et al.  Building an effective corpus by using acoustic space visualization (COSMOS) method [speech recognition applications] , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[13]  John W. Sammon,et al.  A Nonlinear Mapping for Data Structure Analysis , 1969, IEEE Transactions on Computers.