Improved generation of prosodic features in HMM-based Mandarin speech synthesis

The HMM-based Text-to-Speech System can produce high quality synthetic speech with flexible modeling of spectral and prosodic parameters. However, the prosodic features, like F0 and duration trajectories, generated by HMM-based speech synthesis are often excessively smoothed and lack prosodic variance. In HMM-based TTS durations are typically modeled statistically using state duration probability distributions and duration prediction for unseen contexts without high-level linguistic knowledge. And F0 trajectory is generated by the MSD-HMMs as a weighted bias term. In this approach, discrete distributions are used for modeling the VU decision and continuous Gaussian distributions are used for F0 modeling within the voiced regions. Due to this assumption of undefined F0 values in unvoiced regions and the special structure of MSD-HMM, the generated F0 values are limited in accuracy. In this paper, in order to improve the prosodic features generation against the standard HMM framework, an F0 generation process model is used to re-estimate F0 values in the regions of pitch tracking errors, as well as in unvoiced regions. A prior knowledge of VU is imposed in each Mandarin phoneme and they are used for VU decision. Also we design a set of syntax features to improve Mandarin phoneme duration prediction.

[1]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[2]  Yu Hu,et al.  Towards the automatic extraction of fujisaki model parameters for Mandarin , 2003, INTERSPEECH.

[3]  Keikichi Hirose,et al.  Corpus-based generation of prosodic features from text based on generation process model , 2007, INTERSPEECH.

[4]  Simon King,et al.  Robustness of HMM-based speech synthesis , 2008, INTERSPEECH.

[5]  Takashi Nose,et al.  A Style Control Technique for HMM-Based Expressive Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[6]  Takao Kobayashi,et al.  Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training , 2007, IEICE Trans. Inf. Syst..

[7]  Takao Kobayashi,et al.  Robust F0 Estimation of Speech Signal Using Harmonicity Measure Based on Instantaneous Frequency , 2004, IEICE Trans. Inf. Syst..

[8]  J. Bernstein,et al.  Syntax and speech , 1984, Proceedings of the IEEE.

[9]  Ian H. Witten,et al.  Induction of model trees for predicting continuous classes , 1996 .

[10]  Roy D. Patterson,et al.  Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of F0 and periodicity , 1999, EUROSPEECH.

[11]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[12]  Wen-Chao Li A diachronically-motivated segmental phonology of Mandarin Chinese , 1999 .

[13]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[14]  Fang Chen,et al.  Assigning phrase accent to Chinese Text-to-Speech system , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Keiichi Tokuda,et al.  Multi-Space Probability Distribution HMM , 2002 .

[16]  J. R. Quinlan Learning With Continuous Classes , 1992 .

[17]  Keikichi Hirose,et al.  Analysis of voice fundamental frequency contours for declarative sentences of Japanese , 1984 .