Use of Generation Process Model for Improved Control of Fundamental Frequency Contours in HMM-Based Speech Synthesis

The generation process model of fundamental frequency contours is ideal to represent the global features of prosody. It is a command response model, where the commands have clear relations with linguistic and para/nonlinguistic information conveyed by the utterance. By handling fundamental frequency contours in the framework of the generation process model, flexible prosody control becomes possible for speech synthesis. The model can be used to solve problems resulting from hidden Markov model (HMM)-based speech synthesis, which arise from the frame-by-frame treatment of fundamental frequencies. Methods are developed to add constraints based on the model before HMM training and after the speech synthesis processes. As for controls with increased flexibility, a method is developed to focus on the model differences in command magnitudes between the original and target styles. Prosodic focus is realized in synthetic speech with a small number of parallel speech samples, uttered by a speaker not among the speakers forming the training corpus for the baseline HMM-based speech synthesis. The method is also applied to voice and style conversions.

[1]  Keikichi Hirose,et al.  Improved Automatic Extraction of Generation Process Model Commands and Its use for Generating Fundamental Frequency Contours for Training HMM-based Speech Synthesis , 2012, INTERSPEECH.

[2]  Keikichi Hirose,et al.  Improved generation of fundamental frequency in HMM-based speech synthesis using generation process model , 2010, INTERSPEECH.

[3]  Keikichi Hirose,et al.  Applying generation process model constraint to fundamental frequency contours generated by hidden-Markov-model-based speech synthesis , 2012 .

[4]  Alexander Kain,et al.  Spectral voice conversion for text-to-speech synthesis , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[5]  Shigeru Katagiri,et al.  ATR Japanese speech database as a tool of speech recognition and synthesis , 1990, Speech Commun..

[6]  Hirokazu Kameoka,et al.  Generative modeling of speech F0 contours , 2013, INTERSPEECH.

[7]  Keikichi Hirose,et al.  A method for automatic extraction of model parameters from fundamental frequency contours of speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Keikichi Hirose,et al.  Analysis of voice fundamental frequency contours for declarative sentences of Japanese , 1984 .

[9]  Keiichi Tokuda,et al.  Hidden Markov models based on multi-space probability distribution for pitch pattern modeling , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[10]  Keikichi Hirose,et al.  Synthesis of F0 contours using generation process model parameters predicted from unlabeled corpora: application to emotional speech synthesis , 2005, Speech Commun..

[11]  Kai Yu,et al.  Continuous F0 Modeling for HMM Based Statistical Parametric Speech Synthesis , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Hideki Kawahara,et al.  Tandem-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Keikichi Hirose,et al.  Use of generation process model for synthesizing fundamental frequency contours in HMM-based speech synthesis , 2012, 2012 IEEE 11th International Conference on Signal Processing.

[14]  Keikichi Hirose,et al.  Control of prosodic focus in corpus-based generation of fundamental frequency contours of Japanese based on the generation process model , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Yu Hu,et al.  Towards the automatic extraction of fujisaki model parameters for Mandarin , 2003, INTERSPEECH.

[16]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[17]  Keikichi Hirose,et al.  Adaptation of Prosody in Speech Synthesis by Changing Command Values of the Generation Process Model of Fundamental Frequency , 2011, INTERSPEECH.