论文信息 - Two-stage prosody prediction for emotional text-to-speech synthesis

Two-stage prosody prediction for emotional text-to-speech synthesis

In this paper, we adopt a difference approach to prosody prediction for emotional text-to-speech synthesis, where the prosodic variations between emotional and neutral speech are decomposed into the global and local prosodic variations and predicted using a two-stage model. The global prosodic variations are modeled by the means and standard deviations of the prosodic parameters, while the local prosodic variations are modeled by the classification and regression tree (CART) and dynamic programming. The proposed two-stage prosody prediction model has been successfully implemented as a prosodic module in a Festival-MBROLA architecture based emotional text-to-speech synthesis system, which is able to synthesize highly intelligible, natural and expressive speech.

[1] Shigeo Abe DrEng. Pattern Classification , 2001, Springer London.

[2] Michael Picheny,et al. The IBM expressive text-to-speech synthesis system for American English , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[3] Richard Sproat. Multilingual Text-to-Speech Synthesis , 1997 .

[4] Michael Picheny,et al. A corpus-based approach to expressive speech synthesis , 2004, SSW.

[5] Sean R Eddy,et al. What is dynamic programming? , 2004, Nature Biotechnology.

[6] Abeer Alwan,et al. Text to Speech Synthesis: New Paradigms and Advances , 2004 .

[7] Carlo Drioli,et al. Emotional FESTIVAL-MBROLA TTS synthesis , 2005, INTERSPEECH.

[8] Paul Boersma,et al. Praat, a system for doing phonetics by computer , 2002 .

[9] Felix Burkhardt,et al. Emofilt: the simulation of emotional speech by prosody-transformation , 2005, INTERSPEECH.

[10] David G. Stork,et al. Pattern Classification (2nd ed.) , 1999 .

[11] Leo Breiman,et al. Classification and Regression Trees , 1984 .

[12] Paul Boersma,et al. Praat: doing phonetics by computer , 2003 .