Training a supra-segmental parametric F0 model without interpolating F0

Combining multiple intonation models at different linguistic levels is an effective way to improve the naturalness of the predicted F0. In many of these approaches, the intonation models for suprasegmental levels are based on a parametrization of the log-F0 contours over the units of that level. However, many of these parametrisations are not stable when applied to discontinuous signals. Therefore, the F0 signal has to be interpolated. These interpolated values introduce a distortion in the coefficients that degrades the quality of the model. This paper proposes two methods that eliminate the need for such interpolation, one based on regularization and the other on factor analysis. Subjective evaluations show that, for a Discrete-cosine-transform (DCT) syllable-level model, both approaches result in a significant improvement w.r.t. a baseline using interpolated F0. The approach based on regularization yields the best results.

[1]  Anne Lacheret,et al.  Stylization and Trajectory Modelling of Short and Long Term Speech Prosody Variations , 2011, INTERSPEECH.

[2]  Yamato Ohtani,et al.  Continuous F0 in the source-excitation generation for HMM-based TTS: Do we need voiced/unvoiced classification? , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Yonghong Yan,et al.  Improved modeling for F0 generation and V/U decision in HMM-based TTS , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[5]  Keiichi Tokuda,et al.  Statistical approach to vocal tract transfer function estimation based on factor analyzed trajectory HMM , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Keikichi Hirose,et al.  Improved generation of prosodic features in HMM-based Mandarin speech synthesis , 2010, SSW.

[7]  Keiichi Tokuda,et al.  Hidden Markov models based on multi-space probability distribution for pitch pattern modeling , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[8]  Masami Akamine,et al.  Multilevel parametric-base F0 model for speech synthesis , 2008, INTERSPEECH.

[9]  Hansjörg Mixdorff,et al.  Comparison of Fujisaki-model extractors and F0 stylizers , 2009, INTERSPEECH.

[10]  Antonio Bonafonte,et al.  A study of JEMA for intonation modeling , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  S. Buchholz,et al.  Usages of an external duration model for HMM-based speech synthesis , 2009 .

[12]  Charles L. Lawson,et al.  Solving least squares problems , 1976, Classics in applied mathematics.

[13]  Heiga Zen,et al.  Training a parametric-based logF0 model with the minimum generation error criterion , 2010, INTERSPEECH.

[14]  Heiga Zen,et al.  Product of Experts for Statistical Parametric Speech Synthesis , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Sabine Buchholz,et al.  Crowdsourcing Preference Tests, and How to Detect Cheating , 2011, INTERSPEECH.

[16]  O. Cappé,et al.  Regularized estimation of cepstrum envelope from discrete frequency points , 1995, Proceedings of 1995 Workshop on Applications of Signal Processing to Audio and Accoustics.

[17]  Takashi Nose,et al.  Discontinuous Observation HMM for Prosodic-Event-Based F0 Generation , 2012, INTERSPEECH.

[18]  K. Tokuda,et al.  Speech parameter generation from HMM using dynamic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[19]  Chilin Shih,et al.  Stem-ML: language-independent prosody description , 2000, INTERSPEECH.

[20]  Kai Yu,et al.  Continuous F0 Modeling for HMM Based Statistical Parametric Speech Synthesis , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Patricia Riddle,et al.  Modelling and synthesising F0 contours with the discrete cosine transform , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22]  Zhizheng Wu,et al.  Improved prosody generation by maximizing joint likelihood of state and longer units , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.