Prosody analysis and modeling for emotional speech synthesis

Current concatenative text-to-speech systems can synthesize varied emotions, but the subtlety and range of the results are limited because large amounts of emotional speech data are required. The paper studies a more flexible approach based on analyzing and modeling emotional prosody features. Perceptual tests are first performed to investigate whether just manipulating prosody features can attain the communication purposes of emotions. Then, based on the positive results, the same corpus, with sufficient prosody coverage, is shared by different emotions in unit selection. Finally, an adaptation algorithm is proposed to predict the emotional prosody features. It models the prosodic variations by linguistic cues and emotional cues separately, and requires only a small amount of data. Experiments on Mandarin show that the adaptation algorithm can obtain appropriate emotional prosody features, and at least several emotions can be synthesized without the use of a special emotional corpus.

[1]  Raimo Bakis,et al.  Multilayered extensions to the speech synthesis markup language for describing expressiveness , 2003, INTERSPEECH.

[2]  E. Eide Preservation, identification, and use of emotion in a text-to-speech system , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[3]  C. Gobl,et al.  Expressive synthesis: how crucial is voice quality? , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[4]  Alan W. Black Unit selection and emotional speech , 2003, INTERSPEECH.

[5]  Keikichi Hirose,et al.  Analytical and perceptual study on the role of acoustic features in realizing emotional speech , 2000, INTERSPEECH.

[6]  Wei Zhang,et al.  Probability based prosody model for unit selection , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Iain R. Murray,et al.  RULE-BASED EMOTION SYNTHESIS USING CONCATENATED SPEECH , 2000 .

[8]  Shrikanth S. Narayanan,et al.  Expressive speech synthesis using a concatenative synthesizer , 2002, INTERSPEECH.