A corpus-based speech synthesis system with emotion

We propose a new approach to synthesizing emotional speech by a corpus-based concatenative speech synthesis system (ATR CHATR) using speech corpora of emotional speech. In this study, neither emotional-dependent prosody prediction nor signal processing per se is performed for emotional speech. Instead, a large speech corpus is created per emotion to synthesize speech with the appropriate emotion by simple switching between the emotional corpora. This is made possible by the normalization procedure incorporated in CHATR that transforms its standard predicted prosody range according to the source database in use. We evaluate our approach by creating three kinds of emotional speech corpus (anger, joy, and sadness) from recordings of a male and a female speaker of Japanese. The acoustic characteristics of each corpus are different and the emotions identifiable. The acoustic characteristics of each emotional utterance synthesized by our method show clear correlations to those of each corpus. Perceptual experiments using synthesized speech confirmed that our method can synthesize recognizably emotional speech. We further evaluated the method's intelligibility and the overall impression it gives to the listeners. The results show that the proposed method can synthesize speech with a high intelligibility and gives a favorable impression. With these encouraging results, we have developed a workable text-to-speech system with emotion to support the immediate needs of nonspeaking individuals. This paper describes the proposed method, the design and acoustic characteristics of the corpora, and the results of the perceptual evaluations.

[1]  K. Scherer,et al.  Acoustic profiles in vocal emotion expression. , 1996, Journal of personality and social psychology.

[2]  Nick Campbell,et al.  A Speech Synthesis System with Emotion for Assisting Communication , 2000 .

[3]  Alan W. Black,et al.  Prosody and the Selection of Source Units for Concatenative Synthesis , 1997 .

[4]  Rolf Carlson,et al.  Experiments with emotive speech - acted utterances and synthesized replicas , 1992, ICSLP.

[5]  John L. Arnott,et al.  Implementation and testing of a system for producing emotion-by-rule in synthetic speech , 1995, Speech Commun..

[6]  G. Fairbanks,et al.  An experimental study of the pitch characteristics of the voice during the expression of emotion , 1939 .

[7]  J. Davitz,et al.  The communication of emotional meaning , 1964 .

[8]  Laura K. Guerrero,et al.  Communication and emotion: Basic concepts and approaches , 1996 .

[9]  Kumiko Ito,et al.  A basic study on voice sound involving emotion. III. Non-stationary analysis of single vowel (e). , 1986 .

[10]  Sjl Mozziconacci Speech variability and emotion : production and perception , 1998 .

[11]  Syoichi Takeda,et al.  Power Features of "Anger" Expressions in Pseudo-conversational Speech , 2000 .

[12]  Alan F. Newell,et al.  A communication system for the disabled with emotional synthetic speech produced by rule , 1991, EUROSPEECH.

[13]  W. Nick Campbell,et al.  Synthesizing Spontaneous Speech , 1997, Computing Prosody.

[14]  K. Scherer Vocal affect expression: a review and a model for future research. , 1986, Psychological bulletin.

[15]  Patricia A. Keating,et al.  Vowel Variation in Japanese , 1984 .

[16]  George N. Votsis,et al.  Emotion recognition in human-computer interaction , 2001, IEEE Signal Process. Mag..

[17]  J. Russell,et al.  Facial and vocal expressions of emotion. , 2003, Annual review of psychology.

[18]  Iain R. Murray,et al.  RULE-BASED EMOTION SYNTHESIS USING CONCATENATED SPEECH , 2000 .

[19]  Kumiko Ito A basic study on voice sound involving emotion (II) , 1985 .

[20]  Nick Campbell Autolabelling Japanese ToBI , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[21]  Nick Campbell,et al.  Acoustic nature and perceptual testing of corpora of emotional speech , 1998, ICSLP.

[22]  P. Shaver,et al.  Emotion knowledge: further exploration of a prototype approach. , 1987, Journal of personality and social psychology.

[23]  Julia Hirschberg,et al.  Progress in speech synthesis , 1997 .

[24]  K. Scherer,et al.  Vocal cues in emotion encoding and decoding , 1991 .

[25]  H. Timothy Bunnell,et al.  Prosodic vs. segmental contributions to naturalness in a diphone synthesizer , 1998, ICSLP.

[26]  Yoshinori Kitahara,et al.  Prosodic Control to Express Emotions for Man-Machine Speech Interaction , 1992 .

[27]  Alan W. Black,et al.  Generating F/sub 0/ contours from ToBI labels using linear regression , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[28]  Akira Ichikawa,et al.  Role of prosody in cognitive process of spoken language , 1988, Systems and Computers in Japan.

[29]  J. Russell MEASURES OF EMOTION , 1989 .

[30]  A. B.,et al.  SPEECH COMMUNICATION , 2001 .

[31]  Iain R. Murray,et al.  Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion. , 1993, The Journal of the Acoustical Society of America.

[32]  Laura K. Guerrero,et al.  Handbook of communication and emotion : research, theory, applications, and contexts , 1998 .

[33]  M J Griffin,et al.  Six axis vehicle vibration and its effects on comfort. , 1979, Ergonomics.