Incorporating dynamic features into minimum generation error training for HMM-based speech synthesis

This paper describes new methods of minimum generation error (MGE) training in HMM-based speech synthesis by introducing the error component of dynamic features into the generation error function. We propose two methods for setting the weight associated with the additional error component. In fixed weighting approach, this weight is kept constant over the course of speech. In adaptive weighting approach, it is adjusted according to the degree of dynamic of speech segments. Objective evaluation shows that the newly derived MGE criterion with adaptive weighting method obtains comparable performance on static feature and better performance on delta feature compared to the baseline MGE criterion. Subjective evaluation exhibits an improvement in the quality of synthesized speech with the proposed technique. The newly derived criterion improves the capability of the HMMs in capturing dynamic properties of speech without increasing the computational complexity of training process compared to the baseline criterion.

[1]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Mats Blomberg,et al.  Effects of emphasizing transitional or stationary parts of the speech signal in a discrete utterance recognition system , 1982, ICASSP.

[3]  S. Furui On the role of spectral transition for speech perception. , 1986, The Journal of the Acoustical Society of America.

[4]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[5]  Ren-Hua Wang,et al.  Minimum Generation Error Training for HMM-Based Speech Synthesis , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[6]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[7]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Keiichi Tokuda,et al.  Speech synthesis using HMMs with dynamic features , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[9]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[10]  Satoshi Imai,et al.  Cepstral analysis synthesis on the mel frequency scale , 1983, ICASSP.

[11]  Shun-ichi Amari,et al.  A Theory of Adaptive Pattern Classifiers , 1967, IEEE Trans. Electron. Comput..