Modeling prosody patterns for Chinese expressive text-to-speech synthesis

This paper proposes an approach for modeling the prosody patterns of the acoustic features for Chinese expressive text-to-speech (TTS) synthesis. Based on the observation that the speaker usually tends to put more emphasis on one particular syllable within a multi-syllabic prosodic word, we identify such syllable as the core syllable that can be derived from the semantic stress and tone information of the text prompt. We then classify the syllables in speech into four classes, based on their relations with the core syllable in a prosodic word. We analyze the contrastive (neutral versus expressive) speech recordings for each of four classes, and develop a perturbation model that takes into account the prosody pattern to transform neutral speech to expressive speech. Perceptual experiments on both neutral speech recordings and neutral TTS outputs involving 13 subjects indicate that the proposed approach can significantly enhance expressivity in synthesizing expressive speech.

[1]  Nick Campbell Accounting for Voice-Quality Variation , 2004 .

[2]  Lianhong Cai,et al.  Modeling the Expressivity of Input Text Semantics for Chinese Text-to-Speech Synthesis in a Spoken Dialog System , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[3]  Francisco Costa Intrinsic Prosodic Properties of Stressed Vowels in European Portuguese , 2004 .

[4]  N. Campbell,et al.  Voice Quality : the 4 th Prosodic Dimension , 2004 .

[5]  Aijun Li,et al.  Prosody conversion from neutral speech to emotional speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Yi Xu,et al.  Closely related languages, different ways of realizing focus , 2009, INTERSPEECH.

[7]  Plínio Almeida Barbosa,et al.  Unifying Stress Shift and Secondary Stress Phenomena with a Dynamical Systems Rhythm Rule , 2004 .

[8]  Klaus R. Scherer,et al.  The role of intonation in emotional expressions , 2005, Speech Commun..

[9]  Marc Schröder,et al.  Expressing degree of activation in synthetic speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Chiu-yu Tseng,et al.  Speech Prosody : Issues , Approaches and Implications , 2004 .

[11]  Hideki Kawahara,et al.  Aperiodicity extraction and control using mixed mode excitation and group delay manipulation for a high quality speech analysis, modification and synthesis system STRAIGHT , 2001, MAVEBA.

[12]  Nick Campbell Towards synthesising expressive speech; designing and collecting expressive speech data , 2003, INTERSPEECH.

[13]  Abeer Alwan,et al.  Text to Speech Synthesis: New Paradigms and Advances , 2004 .

[14]  Michael Picheny,et al.  The IBM expressive speech synthesis system , 2004, INTERSPEECH.

[15]  Lianhong Cai,et al.  Modeling the acoustic correlates of expressive elements in text genres for expressive text-to-speech synthesis , 2006, INTERSPEECH.

[16]  A. Mehrabian Framework for a comprehensive description and measurement of emotional states. , 1995, Genetic, social, and general psychology monographs.