Improved Prosody Generation by Maximizing Joint Probability of State and Longer Units

The current state-of-the-art hidden Markov model (HMM)-based text-to-speech (TTS) can produce highly intelligible, synthesized speech with decent segmental quality. However, its prosody, especially at phrase or sentence level, still tends to be bland. This blandness is partially due to the fact that the state-based HMM is inadequate in capturing global, hierarchical suprasegmental information in speech signals. In this paper, to improve the TTS prosody, longer units are first explicitly modeled with appropriate parametric distributions. The resultant models are then integrated with the state-based baseline models in generating better prosody by maximizing the joint probability. Experimental results in both Mandarin and English show consistent improvements over our baseline system with only state-based prosody model. The improvements are both objectively measurable and subjectively perceivable.

[1]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[2]  Takao Kobayashi,et al.  Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training , 2007, IEICE Trans. Inf. Syst..

[3]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[4]  Ren-Hua Wang,et al.  USTC System for Blizzard Challenge 2006 an Improved HMM-based Speech Synthesis Method , 2006, Blizzard Challenge.

[5]  Myoung-Wan Koo,et al.  Context-Dependent Phoneme Duration Modeling with Tree-Based State Tying , 2005, IEICE Trans. Inf. Syst..

[6]  Masami Akamine,et al.  Multilevel parametric-base F0 model for speech synthesis , 2008, INTERSPEECH.

[7]  Shinsuke Sakai,et al.  Additive modeling of English F0 contour for speech synthesis , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[8]  Zhizheng Wu,et al.  Improved prosody generation by maximizing joint likelihood of state and longer units , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[10]  Keiichi Tokuda,et al.  Investigation of State Duration Model based on Gamma distribution for HMM-based Speech Synthesis , 2001 .

[11]  Takashi Nose,et al.  A Style Control Technique for HMM-Based Expressive Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[12]  Alan W. Black,et al.  Unit selection in a concatenative speech synthesis system using a large speech database , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[13]  G. W. Snedecor Statistical Methods , 1964 .

[14]  Sadaoki Furui,et al.  Combining Gaussian Mixture Model with Global Variance Term to Improve the Quality of an HMM-Based Polyglot Speech Synthesizer , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[15]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[16]  Heiga Zen,et al.  Reformulating the HMM as a trajectory model by imposing explicit relationships between static and dynamic feature vector sequences , 2007, Comput. Speech Lang..

[17]  Frank K. Soong,et al.  Full HMM Training for Minimizing Generation Error in Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[18]  Xuejing Sun F0 generation for speech synthesis using a multi-tier approach , 2002, INTERSPEECH.

[19]  Yoshinori Sagisaka,et al.  ATR μ-talk speech synthesis system , 1992, ICSLP.

[20]  Takao Kobayashi,et al.  Phone duration modeling using gradient tree boosting , 2008, Speech Commun..

[21]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[22]  Zhizheng Wu,et al.  Duration refinement by jointly optimizing state and longer unit likelihood , 2008, INTERSPEECH.

[23]  Zhizheng Wu,et al.  Modeling and Generating Tone Contour with Phrase Intonation for Mandarin Chinese Speech , 2008, 2008 6th International Symposium on Chinese Spoken Language Processing.

[24]  Keiichi Tokuda,et al.  A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[25]  Keiichi Tokuda,et al.  Multi-Space Probability Distribution HMM , 2002 .

[26]  Koichi Shinoda,et al.  MDL-based context-dependent subword modeling for speech recognition , 2000 .

[27]  Patricia Riddle,et al.  Modelling and synthesising F0 contours with the discrete cosine transform , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[28]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[29]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[30]  Keiichi Tokuda,et al.  Duration modeling for HMM-based speech synthesis , 1998, ICSLP.

[31]  Frank K. Soong,et al.  Generating natural F0 trajectory with additive trees , 2008, INTERSPEECH.

[32]  Wu Yi-jian HMM-based Trainable Speech Synthesis for Chinese , 2006 .