Comparing QMT1 and HMMs for the synthesis of American English prosody

Three models are compared for the duration and pitch contour of American English in a speech synthesis framework. These models combine duration prediction by Quantification Metho d Type 1 (QMT1), a Codebook-based method for the F0 contour and a Hidden Markov Model-based method for both durations and F0. Subjective listening tests show that the HMMs are preferred over the Codebook for the F0 contour, but that their duration modelling performances are not significantly differen t from those of QMT1 in the tested setup. An analysis of naive freeform listener comments supports this fact, and suggests that such comments can give useful hints regarding the performance of each system.

[1]  Takehiko Kagoshima,et al.  An F0 contour control model for totally speaker driven text to speech system , 1998, ICSLP.

[2]  Sadaoki Furui,et al.  Speech-rate-variable HMM-based Japanese TTS system , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[3]  Hyunsong Chung Duration Models and the Perceptual Evaluation of Spoken Korean , 2002 .

[4]  Sabine Buchholz,et al.  The Toshiba entry for the 2007 Blizzard Challenge , 2007 .

[5]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[6]  Jian Li,et al.  Totally data-driven intonation prediction model using a novel F0 contour parametric representation , 2006, INTERSPEECH.

[7]  K. Tokuda,et al.  Speech parameter generation from HMM using dynamic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[8]  Gene V. Glass,et al.  Note on Rank Biserial Correlation , 1966 .

[9]  Takao Kobayashi,et al.  Performance evaluation of style adaptation for hidden semi-Markov model based speech synthesis , 2005, INTERSPEECH.

[10]  Tatsuya Mizutani,et al.  Concatenative Speech Synthesis Based on the Plural Unit Selection and Fusion Method , 2005, IEICE Trans. Inf. Syst..

[11]  Keiichi Tokuda,et al.  Hidden Markov models based on multi-space probability distribution for pitch pattern modeling , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[12]  Jj Odell,et al.  The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[13]  Heiga Zen,et al.  Hidden semi-Markov model based speech synthesis , 2004, INTERSPEECH.

[14]  Simon King,et al.  Statistical analysis of the Blizzard Challenge 2007 listening test results , 2007 .

[15]  Takehiko Kagoshima,et al.  Toshiba English text-to-speech synthesizer (TESS) , 1999, EUROSPEECH.