Toward a Rule-Based Synthesis of Emotional Speech on Linguistic Descriptions of Perception

This paper reports rules for morphing a voice to make it be perceived as containing various primitive features, for example, to make it sound more “bright” or “dark”. In a previous work we proposed a three-layered model, which contains emotional speech, primitive features, and acoustic features, for the perception of emotional speech. By experiments and acoustic analysis, we built the relationships between the three layers and reported that such relationships are significant. Then, a bottom-up method was adopted in order to verify the relationships. That is, we morphed (resynthesized) a speech voice by composing acoustic features in the bottommost layer to produce a voice in which listeners could perceive a single or multiple primitive features, which could be further perceived as different categories of emotion. The intermediate results show that the relationships of the model built in previous work are valid.

[1]  Hideki Kawahara,et al.  Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds , 1999, Speech Commun..

[2]  John L. Arnott,et al.  Implementation and testing of a system for producing emotion-by-rule in synthetic speech , 1995, Speech Commun..

[3]  Masato Akagi,et al.  A multi-layer fuzzy logical model for emotional speech perception , 2005, INTERSPEECH.

[4]  Nick Campbell,et al.  A corpus-based speech synthesis system with emotion , 2003, Speech Commun..

[5]  Jef Raskin,et al.  The Humane Interface: New Directions for Designing Interactive Systems , 2000 .

[6]  Daniel M. Brown,et al.  Review of The humane interface , 2002 .

[7]  Rosalind W. Picard Affective Computing , 1997 .

[8]  Mike Edgington,et al.  Investigating the limitations of concatenative synthesis , 1997, EUROSPEECH.

[9]  Hideki Kawahara,et al.  Investigation of emotionally morphed speech perception and its structure using a high quality speech manipulation system , 2003, INTERSPEECH.

[10]  Marc Schröder,et al.  Emotional speech synthesis: a review , 2001, INTERSPEECH.

[11]  Juan Manuel Montero-Martínez,et al.  Emotional speech synthesis: from speech database to TTS , 1998, ICSLP.

[12]  Janet E. Cahn Generating expression in synthesized speech , 1989 .

[13]  Jean Vroomen,et al.  Duration and intonation in emotional speech , 1993, EUROSPEECH.

[14]  Mark Tatham,et al.  Expression in Speech , 2003 .

[15]  Iain R. Murray,et al.  Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion. , 1993, The Journal of the Acoustical Society of America.

[16]  Barbara Heuft,et al.  Emotions in time domain synthesis , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[17]  D. Massaro,et al.  Perceiving Talking Faces , 1995 .