Synthesis of F0 contours using generation process model parameters predicted from unlabeled corpora: application to emotional speech synthesis

Abstract A corpus-based method of generating fundamental frequency (F0) contours from text was developed for Japanese. Instead of directly predicting F0 values, the method predicts command values of the F0 contour generation process model using binary decision trees. Since the model controls the F0 movement in word or in longer units, sudden undulations, unlikely in natural utterances, can be avoided even in the case of erroneous prediction. The method includes a scheme of extracting the model commands from given F0 contours, which makes it possible to prepare the corpora for training the binary decision trees automatically. Since accuracy of the extracted model commands in the training corpora is crucial for the method, constraints are applied on the location of commands. Although the method can generate any speaking styles if the corpora of the styles are available, this paper is aimed at realizing three types of emotional speech (anger, joy, and sadness) besides calm speech. The mismatches between the predicted and target contours for angry speech were similar to those for calm speech. Synthesis of emotional speech was then conducted. Phoneme durations were predicted in a similar corpus-based method, and segmental features were generated using an HMM-based speech synthesizer. A perceptual experiment was conducted for the synthesized speech, and the result indicated that anger could be conveyed well by the developed method. The result was less satisfactory for joy and sadness.

[1]  Keikichi Hirose,et al.  Corpus-based synthesis of fundamental frequency contours of Japanese using automatically-generated prosodic corpus and generation process model , 2003, INTERSPEECH.

[2]  Keikichi Hirose,et al.  Improved corpus-based synthesis of fundamental frequency contours using generation process model , 2002, INTERSPEECH.

[3]  Kiyohiro Shikano,et al.  Julius - an open source real-time large vocabulary recognition engine , 2001, INTERSPEECH.

[4]  Keiichi Tokuda,et al.  Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis , 1999, EUROSPEECH.

[5]  Mari Ostendorf,et al.  A dynamical system model for generating F0 for synthesis , 1994, SSW.

[6]  Keikichi Hirose,et al.  A System for the Synthesis of High-Quality Speech from Texts on General Weather Conditions (Special Section on Speech Synthesis: Current Technologies and Equipment) , 1993 .

[7]  Keikichi Hirose,et al.  Corpus-based synthesis of fundamental frequency contours based on a generation process model , 2001, INTERSPEECH.

[8]  Takashi Aso,et al.  A study on pitch pattern generation using HMM-based statistical information , 1994, ICSLP.

[9]  Yoshinori Kitahara,et al.  Prosodic Control to Express Emotions for Man-Machine Speech Interaction , 1992 .

[10]  John L. Arnott,et al.  Implementation and testing of a system for producing emotion-by-rule in synthetic speech , 1995, Speech Commun..

[11]  Alan W. Black,et al.  Generating F/sub 0/ contours from ToBI labels using linear regression , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[12]  Keikichi Hirose,et al.  Analysis of voice fundamental frequency contours for declarative sentences of Japanese , 1984 .

[13]  Hansjörg Mixdorff,et al.  Learning the parameters of quantitative prosody models , 2000, INTERSPEECH.

[14]  Keikichi Hirose,et al.  A method for automatic extraction of model parameters from fundamental frequency contours of speech , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  Heiga Zen,et al.  Constructing emotional speech synthesizers with limited speech database , 2004, INTERSPEECH.

[16]  Marc Schröder,et al.  Emotional speech synthesis: a review , 2001, INTERSPEECH.

[17]  Keikichi Hirose Improvement in corpus-based generation of F0 contours using generation process model for emotional speech synthesis , 2004, INTERSPEECH.

[18]  Keikichi Hirose,et al.  Corpus-based synthesis of fundamental frequency contours with various speaking styles from text using F0 contour generation process model , 2004, SSW.

[19]  Makoto Nagao,et al.  A Syntactic Analysis Method of Long Japanese Sentences Based on the Detection of Conjunctive Structures , 1994, CL.

[20]  Keikichi Hirose,et al.  Automatic Estimation of Accentual Attribute Values of Words for Accent Sandhi Rules of Japanese Text-to-Speech Conversion , 2003 .

[21]  Y. Sagisaka,et al.  On the prediction of global F/sub 0/ shape for Japanese text-to-speech , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[22]  Takao Kobayashi,et al.  Modeling of various speaking styles and emotions for HMM-based speech synthesis , 2003, INTERSPEECH.

[23]  W. Sendlmeier,et al.  Verification of acoustical correlates of emotional speech using formant-synthesis , 2000 .

[24]  Leonardo Badino,et al.  Prosodic analysis of a multi-style corpus in the perspective of emotional speech synthesis , 2004, INTERSPEECH.

[25]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.

[26]  Keikichi Hirose,et al.  Data-driven generation of F0 contours using a superpositional model , 2003, Speech Commun..

[27]  Keikichi Hirose,et al.  Data-Driven Synthesis of Fundamental Frequency Contours for TTS Systems Based on a Generation Process Model , 2002 .

[28]  Paul Taylor,et al.  The tilt intonation model , 1998, ICSLP.

[29]  Keikichi Hirose,et al.  Synthesizing dialogue speech of Japanese based on the quantitative analysis of prosodic features , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[30]  Yoshinori Sagisaka,et al.  Automatic Extraction of F 0 Control Rules Using Statistical Analysis , 1997 .

[31]  Julia Hirschberg,et al.  Progress in speech synthesis , 1997 .

[32]  Hansjörg Mixdorff,et al.  Building an integrated prosodic model of German , 2001, INTERSPEECH.

[33]  Andrej Ljolje,et al.  Synthesis of natural sounding pitch contours in isolated utterances using hidden Markov models , 1986, IEEE Trans. Acoust. Speech Signal Process..

[34]  Keikichi Hirose,et al.  Analytical and perceptual study on the role of acoustic features in realizing emotional speech , 2000, INTERSPEECH.

[35]  Keiichi Tokuda,et al.  Hidden Markov models based on multi-space probability distribution for pitch pattern modeling , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[36]  J. Pierrehumbert,et al.  Synthesizing intonation , 2004 .