A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis

This paper describes a novel parameter generation algorithm for an HMM-based speech synthesis technique. The conventional algorithm generates a parameter trajectory of static features that maximizes the likelihood of a given HMM for the parameter sequence consisting of the static and dynamic features under an explicit constraint between those two features. The generated trajectory is often excessively smoothed due to the statistical processing. Using the over-smoothed speech parameters usually causes muffled sounds. In order to alleviate the over-smoothing effect, we propose a generation algorithm considering not only the HMM likelihood maximized in the conventional algorithm but also a likelihood for a global variance (GV) of the generated trajectory. The latter likelihood works as a penalty for the over-smoothing, i.e., a reduction of the GV of the generated trajectory. The result of a perceptual evaluation demonstrates that the proposed algorithm causes considerably large improvements in the naturalness of synthetic speech.

[1]  Satoshi Imai,et al.  Cepstral analysis synthesis on the mel frequency scale , 1983, ICASSP.

[2]  D H Klatt,et al.  Review of text-to-speech conversion for English. , 1987, The Journal of the Acoustical Society of America.

[3]  Y. Sagisaka,et al.  Speech synthesis by rule using an optimal selection of non-uniform synthesis units , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[4]  S. Nakajima,et al.  Automatic generation of synthesis units based on context oriented clustering , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[5]  Keiichi Tokuda,et al.  An adaptive algorithm for mel-cepstral analysis of speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  N. Iwahashi,et al.  Speech Segment Selection for Concatenative Synthesis Based on Spectral Distortion Minimization , 1993 .

[7]  K. Tokuda,et al.  Speech parameter generation from HMM using dynamic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[8]  Keiichi Tokuda,et al.  An algorithm for speech parameter generation from continuous mixture HMMs with dynamic features , 1995, EUROSPEECH.

[9]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[10]  Takehiko Kagoshima,et al.  An F0 contour control model for totally speaker driven text to speech system , 1998, ICSLP.

[11]  Takehiko Kagoshima,et al.  Analytic generation of synthesis units by closed loop training for totally speaker driven text to speech system (TOS drive TTS) , 1998, ICSLP.

[12]  Keiichi Tokuda,et al.  Hidden Markov models based on multi-space probability distribution for pitch pattern modeling , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[13]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[14]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[15]  Hideyuki Mizuno,et al.  A new F0 contour control method based on vector representation of F0 contour , 1999, EUROSPEECH.

[16]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[17]  Matthew J. Makashay,et al.  Corpus-based techniques in the AT&t nextgen synthesis system , 2000, INTERSPEECH.

[18]  Koichi Shinoda,et al.  MDL-based context-dependent subword modeling for speech recognition , 2000 .

[19]  Hu Peng,et al.  Selecting non-uniform units from a very large corpus for concatenative speech synthesizer , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[20]  Takashi Saito,et al.  Generating F0 Contours by Statistical Manipulation of Natural F0 Shapes , 2001, IEICE Trans. Inf. Syst..

[21]  Yannis Stylianou,et al.  Applying the harmonic plus noise model in concatenative speech synthesis , 2001, IEEE Trans. Speech Audio Process..

[22]  Keiichi Tokuda,et al.  Multi-Space Probability Distribution HMM , 2002 .

[23]  H. Zen,et al.  An HMM-based speech synthesis system applied to English , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[24]  H. Kawai,et al.  Study on time-dependent voice quality variation in a large-scale single speaker speech corpus used for speech synthesis , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[25]  吉村 貴克,et al.  Simultaneous modeling of phonetic and prosodic parameters,and characteristic conversion for HMM-based text-to-speech systems , 2002 .

[26]  Antoine Raux,et al.  A unit selection approach to F0 modeling and its application to emphasis , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[27]  Keiichi Tokuda,et al.  XIMERA: a new TTS from ATR based on corpus-based technologies , 2004, SSW.

[28]  Tatsuya Mizutani,et al.  Concatenative Speech Synthesis Based on the Plural Unit Selection and Fusion Method , 2005, IEICE Trans. Inf. Syst..

[29]  Heiga Zen,et al.  An overview of nitech HMM-based speech synthesis system for blizzard challenge 2005 , 2005, INTERSPEECH.

[30]  Keiichi Tokuda,et al.  Spectral conversion based on maximum likelihood estimation considering global variance of converted parameter , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[31]  Tomoki Toda,et al.  An evaluation of cost functions sensitively capturing local degradation of naturalness for segment selection in concatenative speech synthesis , 2006, Speech Commun..