HMM-Based Emphatic Speech Synthesis Using Unsupervised Context Labeling

This paper describes an approach to HMM-based expressive speech synthesis which does not require any supervised labeling process for emphasis context. We use appealing-style speech whose sentences were taken from real domains. To reduce the cost for labeling speech data with an emphasis context for the model training, we propose an unsupervised labeling technique of the emphasis context based on the difference between original and generated F0 patterns of training sentences. Although the criterion for the emphasis labeling is quite simple, subjective evaluation results reveal that the unsupervised labeling is comparable to the labeling conducted carefully by a human in terms of speech naturalness and emphasis reproducibility. Index Terms: HMM-based speech synthesis, expressive speech, emphasis expression, unsupervised labeling, F0 generation

[1]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[2]  Koichi Shinoda,et al.  MDL-based context-dependent subword modeling for speech recognition , 2000 .

[3]  Daniel Jurafsky,et al.  The detection of emphatic words using acoustic and lexical features , 2005, INTERSPEECH.

[4]  Kai Yu,et al.  Word-level emphasis modelling in HMM-based speech synthesis , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Takao Kobayashi,et al.  Acoustic Modeling of Speaking Styles and Emotional Expressions in HMM-Based Speech Synthesis , 2005, IEICE Trans. Inf. Syst..

[6]  D. Ladd,et al.  The perception of intonational emphasis: continuous or categorical? , 1997 .

[7]  Junichi Yamagishi,et al.  Identification of contrast and its emphatic realization in HMM based speech synthesis , 2009, INTERSPEECH.

[8]  Tomoki Toda,et al.  Emphasized speech synthesis based on hidden Markov models , 2009, 2009 Oriental COCOSDA International Conference on Speech Database and Assessments.

[9]  Heiga Zen,et al.  Context adaptive training with factorized decision trees for HMM-based speech synthesis , 2010, INTERSPEECH.

[10]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[11]  Heiga Zen,et al.  The HMM-based speech synthesis system (HTS) version 2.0 , 2007, SSW.

[12]  Jun Xu,et al.  Automatic Emphasis Labeling for Emotional Speech by Measuring Prosody Generation Error , 2009, ICIC.