Building an Effective Speech Corpus by Utilizing Statistical Multidimensional Scaling Method

This paper proposes a technique for building an effective speech corpus with lower cost by utilizing a statistical multidimensional scaling method. The statistical multidimensional scaling method visualizes multiple HMM acoustic models into two-dimensional space. At first, a small number of voice samples per speaker is collected; speaker adapted acoustic models trained with collected utterances, are mapped into two-dimensional space by utilizing the statistical multidimensional scaling method. Next, speakers located in the periphery of the distribution, in a plotted map are selected; a speech corpus is built by collecting enough voice samples for the selected speakers. In an experiment for building an isolated-word speech corpus, the performance of an acoustic model trained with 200 selected speakers was equivalent to that of an acoustic model trained with 533 non-selected speakers. It means that a cost reduction of more than 62% was achieved. In an experiment for building a continuous word speech corpus, the performance of an acoustic model trained with 500 selected speakers was equivalent to that of an acoustic model trained with 1179 non-selected speakers. It means that a cost reduction of more than 57% was achieved.

[1]  Makoto Shozakai,et al.  Analysis of speaking styles by two-dimensional visualization of aggregate of acoustic models , 2004, INTERSPEECH.

[2]  M. Shozakai,et al.  Acoustic space analysis method utilizing statistical multidimensional scaling technique , 2005 .

[3]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[4]  Makoto Shozakai,et al.  Building an effective corpus by using acoustic space visualization (COSMOS) method [speech recognition applications] , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[5]  Makoto Shozakai,et al.  Design of ready-made acoustic model library by two-dimensional visualization of acoustic space , 2004, INTERSPEECH.

[6]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[7]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[8]  John W. Sammon,et al.  A Nonlinear Mapping for Data Structure Analysis , 1969, IEEE Transactions on Computers.

[9]  Anil K. Jain,et al.  Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[10]  Herman J. M. Steeneken,et al.  Optimal selection of speech data for automatic speech recognition systems , 2002, INTERSPEECH.

[11]  Makoto Shozakai,et al.  Analyzing reusability of speech corpus based on statistical multidimensional scaling method , 2006, INTERSPEECH.

[12]  H. Akaike A new look at the statistical model identification , 1974 .

[13]  Chin-Hui Lee,et al.  Maximum a posteriori linear regression for hidden Markov model adaptation , 1999, EUROSPEECH.

[14]  Makoto Shozakai,et al.  Distance measure between Gaussian distributions for discriminating speaking styles , 2006, INTERSPEECH.

[15]  Lin-Shan Lee,et al.  Fast speaker adaptation using eigenspace-based maximum likelihood linear regression , 2000, INTERSPEECH.