论文信息 - Design of ready-made acoustic model library by two-dimensional visualization of acoustic space

Design of ready-made acoustic model library by two-dimensional visualization of acoustic space

This paper proposes the technique enabling a design of readymade library composed of high performance and small size acoustic models utilizing the method of visualizing multiple HMM acoustic models onto two-dimensional space (“COSMOS” method: aCOustic Space Map Of Sound), and providing one of these models without overburdening users. The acoustic space (as expressed in multi-dimensional future parameters) is partitioned into zones on two-dimensional space, allowing for the creation of highly precise acoustic models through the generation of acoustic models for respective zones of the acoustic space. A set of these acoustic models is called an acoustic model library. In an experiment of this paper, a plotted map (called the COSMOS map) featuring a total of 145 male speakers speaking in various styles was generated utilizing the COSMOS method. Through the COSMOS map, the distribution of each speaking styles and the relationship between the positioning of the speaker on the COSMOS map and the speech-recognition performance were analyzed, thereby demonstrating the effectiveness of the COSMOS method in the analysis of acoustic space. The COSMOS map was then partitioned into concentric acoustic space zones to produce acoustic models representing each acoustic space zones. By selecting the acoustic model providing maximum likelihood score effectively using voice samples consisting of 5 words, the acoustic model, even if expressed in single Gaussian distribution, showed high performance comparable to speaker-independent acoustic model (called SI-model) expressed in 16 mixture Gaussian distributions. Furthermore, the acoustic model showed performance higher than SI-model adapted with voice samples of 30 words by the MLLR [2] method.

Makoto Shozakai | Goshu Nagino

[1] Herman J. M. Steeneken,et al. Optimal selection of speech data for automatic speech recognition systems , 2002, INTERSPEECH.

[2] Chin-Hui Lee,et al. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[3] John W. Sammon,et al. A Nonlinear Mapping for Data Structure Analysis , 1969, IEEE Transactions on Computers.

[4] Roland Kuhn,et al. Rapid speaker adaptation in eigenvoice space , 2000, IEEE Trans. Speech Audio Process..

[5] Tetsuo Kosaka,et al. Tree-structured speaker clustering for speaker-independent continuous speech recognition , 1994, ICSLP.

[6] Anil K. Jain,et al. Statistical Pattern Recognition: A Review , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[7] Philip C. Woodland,et al. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[8] Mark J. F. Gales. Cluster adaptive training for speech recognition , 1998, ICSLP.