Discontinuous Observation HMM for Prosodic-Event-Based F0 Generation

This paper examines F0 modeling and generation techniques for spontaneous speech synthesis. In the previous study, we proposed a prosodic-unit HMM where the synthesis unit is defined as a segment between two prosodic events represented by a ToBI label framework. To take the advantage of the prosodicunit HMM, continuous F0 sequences must be modeled from discontinuous F0 data including unvoiced regions. The conventional F0 models such as the MSD-HMM and the continuous F0 HMM are not always appropriate for such demand. To overcome this problem, we propose an alternative F0 model named discontinuous observation HMM (DO-HMM) where the unvoiced frames are regarded as missing data. We objectively evaluate the performance of the DO-HMM by comparing it with the conventional F0 modeling techniques and discuss the results.

[1]  Takashi Nose,et al.  On the Use of Extended Context for HMM-Based Spontaneous Conversational Speech Synthesis , 2011, INTERSPEECH.

[2]  Heiga Zen,et al.  Hidden semi-Markov model based speech synthesis , 2004, INTERSPEECH.

[3]  K. Maekawa CORPUS OF SPONTANEOUS JAPANESE : ITS DESIGN AND EVALUATION , 2003 .

[4]  Keiichi Tokuda,et al.  Multi-Space Probability Distribution HMM , 2002 .

[5]  Kai Yu,et al.  Continuous F0 Modeling for HMM Based Statistical Parametric Speech Synthesis , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[7]  Mari Ostendorf,et al.  A dynamical system model for generating fundamental frequency for speech synthesis , 1999, IEEE Trans. Speech Audio Process..

[8]  Takashi Nose,et al.  An F0 modeling technique based on prosodic events for spontaneous speech synthesis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Hideaki Kikuchi,et al.  X-JToBI: an extended j-toBI for spontaneous speech , 2002, INTERSPEECH.

[10]  Yonghong Yan,et al.  Improved modeling for F0 generation and V/U decision in HMM-based TTS , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Jj Odell,et al.  The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[12]  Arturo Camacho Lozano,et al.  SWIPE: A Sawtooth Waveform Inspired Pitch Estimator for Speech and Music , 2011 .