Continuous F0 in the source-excitation generation for HMM-based TTS: Do we need voiced/unvoiced classification?

Most HMM-based TTS systems use a hard voiced/unvoiced classification to produce a discontinuous F0 signal which is used for the generation of the source-excitation. When a mixed source excitation is used, this decision can be based on two different sources of information: the state-specific MSD-prior of the F0 models, and/or the frame-specific features generated by the aperiodicity model. This paper examines the meaning of these variables in the synthesis process, their interaction, and how they affect the perceived quality of the generated speech The results of several perceptual experiments show that when using mixed excitation, subjects consistently prefer samples with very few or no false unvoiced errors, whereas a reduction in the rate of false voiced errors does not produce any perceptual improvement. This suggests that rather than using any form of hard voiced/unvoiced classification, e.g., the MSD-prior, it is better for synthesis to use a continuous F0 signal and rely on the frame-level soft voiced/unvoiced decision of the aperiodicity model.

[1]  Yonghong Yan,et al.  Improved modeling for F0 generation and V/U decision in HMM-based TTS , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2]  Keikichi Hirose,et al.  Improved generation of prosodic features in HMM-based Mandarin speech synthesis , 2010, SSW.

[3]  Philip J. B. Jackson,et al.  Pitch-scaled estimation of simultaneous voiced and turbulence-noise components in speech , 2001, IEEE Trans. Speech Audio Process..

[4]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[5]  Keiichi Tokuda,et al.  Mixed excitation for HMM-based speech synthesis , 2001, INTERSPEECH.

[6]  Thomas P. Barnwell,et al.  MCCREE AND BARNWELL MIXED EXCITAmON LPC VOCODER MODEL LPC SYNTHESIS FILTER 243 SYNTHESIZED SPEECH-PERIODIC PULSE TRAIN-1 PERIODIC POSITION JITTER PULSE 4 , 2004 .

[7]  Kai Yu,et al.  From discontinuous to continuous F0 modelling in HMM-based speech synthesis , 2010, SSW.

[8]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[9]  Keiichi Tokuda,et al.  Hidden Markov models based on multi-space probability distribution for pitch pattern modeling , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[10]  K. Tokuda,et al.  Speech parameter generation from HMM using dynamic features , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[11]  Tomoki Toda,et al.  Probablistic modelling of F0 in unvoiced regions in HMM based speech synthesis , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.