Multilevel parametric-base F0 model for speech synthesis

Abstract This paper proposes a new F0 model for speech synthesis basedon the parameterization of the logF0 contour of the syllables.This parameterization consists of the N -order discrete cosinetransform (DCT) plus some additional parameters such as thegradient of the syllable average pitch. A statistical model of thesyllable pitch contour is then created by clustering the param-eterized vectors with a decision tree. Similar statistical modelsare also created for other linguistic levels other than the syllable.For synthesis, the statistical model of each level is used to definea log-likelihood function for the input text. These functions arethen weighted and added into a global log-likelihood functionwhich is then maximized with respect to the DCT coefficients ofthe syllable model. The final logF0 contour is obtained from theinverse transformation of the syllable DCT coefficients. A sub-jective test showed a clear preference for the proposed modelagainst our previous HMM-based baseline.Index Terms: speech synthesis, HMM-based synthesis,prosody, discrete cosine transform

[1]  Takehiko Kagoshima,et al.  An F0 contour control model for totally speaker driven text to speech system , 1998, ICSLP.

[2]  Sabine Buchholz,et al.  Comparing QMT1 and HMMs for the synthesis of American English prosody , 2008 .

[3]  Esther Klabbers,et al.  Estimating phrase curves in the general superpositional intonation model , 2004, SSW.

[4]  Patricia Riddle,et al.  Modelling and synthesising F0 contours with the discrete cosine transform , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Abeer Alwan,et al.  Text to Speech Synthesis: New Paradigms and Advances , 2004 .

[6]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[7]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[8]  Keiichi Tokuda,et al.  Hidden Markov models based on multi-space probability distribution for pitch pattern modeling , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[9]  K. Tokuda,et al.  Speech parameter generation from HMM using dynamic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[10]  Tatsuya Mizutani,et al.  Concatenative Speech Synthesis Based on the Plural Unit Selection and Fusion Method , 2005, IEICE Trans. Inf. Syst..