Modeling DCT parameterized F0 trajectory at intonation phrase level with DNN or decision tree

In the conventional HMM-based TTS, the micro structure of F0 contour is modeled at the state level via a (clustered) decision tree. However, the decision tree based state-level modeling is difficult to capture the long term structure of speech prosody, say at intonation phrase level, due to its greedy search nature and usually sparse training data for covering a large, combinatorial number of usually long prosodic contexts in a phrase or sentence. In this study, we adopt a finite number of Discrete Cosine Transform (DCT) coefficients to capture the smoothed trend of F0 patterns of intonation phrases and to normalize the variable duration effects in phrase length. We then use DCT smoothed contours to model phrase intonations with a decision tree or a deep neural network (DNN). The remaining details or the residual F0 is then accommodated by training a state-level model in a Hierarchical Prosody Model (HPM) framework. The internal phrase models are then used to predict the intonation phrase F0 contours and then combine it with the predicted state-level F0 residuals to predict final F0 contours. Either the decision tree based or the DNN based F0 predictors, when working together with the state-level F0 residual predictors, outperform the standard, state-level HMM F0 models.

[1]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[3]  Heng Lu,et al.  The USTC and iFlytek Speech Synthesis Systems for Blizzard Challenge 2007 , 2007 .

[4]  Patricia Riddle,et al.  Modelling and synthesising F0 contours with the discrete cosine transform , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Heiga Zen,et al.  Context-dependent additive log f_0 model for HMM-based speech synthesis , 2009, INTERSPEECH.

[6]  Shinsuke Sakai F0 modeling with multi-layer additive modeling based on a statistical learning technique , 2004, SSW.

[7]  Frank K. Soong,et al.  A hierarchical F0 modeling method for HMM-based speech synthesis , 2010, INTERSPEECH.

[8]  Keiichi Tokuda,et al.  Hidden Markov models based on multi-space probability distribution for pitch pattern modeling , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[9]  Frank K. Soong,et al.  Modeling pitch trajectory by hierarchical HMM with minimum generation error training , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Alan W. Black,et al.  Accent Group modeling for improved prosody in statistical parameteric speech synthesis , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[12]  Masami Akamine,et al.  Multilevel parametric-base F0 model for speech synthesis , 2008, INTERSPEECH.

[13]  Li-Rong Dai,et al.  Multi-Layer F0 Modeling for HMM-Based Speech Synthesis , 2008, 2008 6th International Symposium on Chinese Spoken Language Processing.

[14]  K. Tokuda,et al.  Speech parameter generation from HMM using dynamic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[15]  Frank K. Soong,et al.  Generating natural F0 trajectory with additive trees , 2008, INTERSPEECH.

[16]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[17]  Zhizheng Wu,et al.  Modeling and Generating Tone Contour with Phrase Intonation for Mandarin Chinese Speech , 2008, 2008 6th International Symposium on Chinese Spoken Language Processing.

[18]  Zhizheng Wu,et al.  Improved Prosody Generation by Maximizing Joint Probability of State and Longer Units , 2011, IEEE Transactions on Audio, Speech, and Language Processing.