A style control technique for HMM-based speech synthesis

This paper describes an approach to controlling style of synthetic speech in HMM-based speech synthesis. The style is defined as one of speaking styles and emotional expressions in speech. We model each speech synthesis unit by using a context-dependent HMM whose mean vector of the output distribution function is given by a function of a parameter vector called style control vector. We assume that the mean vector is modeled by multiple regression with the style control vector. The multiple regression matrices are estimated by EMalgorithm as well as other model parameters of HMMs. In the synthesis stage, the mean vectors are modified by transforming an arbitrarily given control vector which is associated with a desired style. The results of subjective tests show that we can control styles by choosing the style control vector appropriately.

[1]  K. Tokuda,et al.  A Training Method of Average Voice Model for HMM-Based Speech Synthesis , 2003, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[2]  Shigeki Sagayama,et al.  Multiple-regression hidden Markov model , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[3]  Takao Kobayashi,et al.  Modeling of various speaking styles and emotions for HMM-based speech synthesis , 2003, INTERSPEECH.

[4]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[5]  J. Yamagishi,et al.  HMM-Based Speech Synthesis with Various Speaking Styles Using Model Interpolation , 2004 .

[6]  Marc Schröder,et al.  Emotional speech synthesis: a review , 2001, INTERSPEECH.

[7]  Takao Kobayashi,et al.  Speaking style adaptation using context clustering decision tree for HMM-based speech synthesis , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Masatsune Tamura,et al.  A Context Clustering Technique for Average Voice Models , 2003 .