Generation of fundamental frequency contours of Mandarin in HMM-based speech synthesis using generation process model

The HMM-based speech synthesis system can produce high quality synthetic speech with flexible modeling of spectral and prosodic parameters. In this approach, short term spectra, fundamental frequency (F0) and duration are generated by multi-stream HMMs separately. However the quality of synthetic speech degrades when feature vectors used in training are noisy. Among all noisy features, pitch tracking errors and corresponding flawed voiced/unvoiced (VU) decisions are the two key factors in voice quality problems. Pitch tracking errors occur more often in Mandarin vowels of Tone 3 and Tone 4, because the pitch of these vowels can be very low and sometimes treated as aperiodic signal. On the other hand, F0 values in unvoiced regions, such as consonants, are normally defined as unavailable; it is then impossible to use standard HMMs for F0 modeling. Currently a preferred method to solve this is to use a multi-space distribution HMM (MSDHMM). In this approach, discrete distributions are used for modeling the VU decision and continuous Gaussian distributions are used for F0 modeling within the voiced regions. Due to this assumption of undefined F0 values in unvoiced regions and the special structure of MSDHMM, the generated F0 values are limited in accuracy. In this paper, an F0 generation process model is used to estimate F0 values in the region of pitch tracking errors, as well as in unvoiced regions. A prior knowledge of VU is imposed in each Mandarin phoneme and then used for VU decision. Thus the F0 can be modeled within the standard HMM framework.

[1]  Takao Kobayashi,et al.  Robust F0 Estimation of Speech Signal Using Harmonicity Measure Based on Instantaneous Frequency , 2004, IEICE Trans. Inf. Syst..

[2]  Yu Hu,et al.  Towards the automatic extraction of fujisaki model parameters for Mandarin , 2003, INTERSPEECH.

[3]  Keikichi Hirose,et al.  Analysis of voice fundamental frequency contours for declarative sentences of Japanese , 1984 .

[4]  Keiichi Tokuda,et al.  Speech parameter generation algorithms for HMM-based speech synthesis , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[5]  Simon King,et al.  Robustness of HMM-based speech synthesis , 2008, INTERSPEECH.

[6]  Ren-Hua Wang,et al.  MANDARIN TEXT-TO-SPEECH SYNTHESIS , 2006 .

[7]  David Talkin,et al.  A Robust Algorithm for Pitch Tracking ( RAPT ) , 2005 .

[8]  Takashi Nose,et al.  A Style Control Technique for HMM-Based Expressive Speech Synthesis , 2007, IEICE Trans. Inf. Syst..

[9]  Roy D. Patterson,et al.  Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of F0 and periodicity , 1999, EUROSPEECH.

[10]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[11]  Takao Kobayashi,et al.  Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training , 2007, IEICE Trans. Inf. Syst..