Training a parametric-based logF0 model with the minimum generation error criterion

This paper describes an approach for improving a statistical parametric-based logF0 model using minimum-generationerror (MGE) training. Compared with the previous scheme based on decision tree clustering, MGE allows the minimisation of the error in the generated logF0 to take into account not only each cluster by itself, but also the way in which the clusters interact with each other in the generation of the F0 over the whole sentence. Moreover, the “weights” of each component of the model, which previously were adjusted manually, are optimized automatically by the MGE training during the re-estimation of the model covariances. Objective evaluation indicated that, although the logF0 contours generated by the models trained with MGE have approximately the same root mean square error and correlation factor as those generated with the baseline models, they present a higher dynamic range. The subjective evaluation shows a small but significant preference for the system trained with MGE.

[1]  Keiichi Tokuda,et al.  Hidden Markov models based on multi-space probability distribution for pitch pattern modeling , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[2]  S. Buchholz,et al.  Usages of an external duration model for HMM-based speech synthesis , 2009 .

[3]  K. Tokuda,et al.  Speech parameter generation from HMM using dynamic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[4]  Zhizheng Wu,et al.  Improved prosody generation by maximizing joint likelihood of state and longer units , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Ren-Hua Wang,et al.  Minimum Generation Error Training for HMM-Based Speech Synthesis , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[6]  Chilin Shih,et al.  Stem-ML: language-independent prosody description , 2000, INTERSPEECH.

[7]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[8]  Masami Akamine,et al.  Multilevel parametric-base F0 model for speech synthesis , 2008, INTERSPEECH.