The production and recognition of emotions in speech: features and algorithms

This paper presents algorithms that allow a robot to express its emotions by modulating the intonation of its voice. They are very simple and efficiently provide life-like speech thanks to the use of concatenative speech synthesis. We describe a technique which allows to continuously control both the age of a synthetic voice and the quantity of emotions that are expressed. Also, we present the first large-scale data mining experiment about the automatic recognition of basic emotions in informal everyday short utterances. We focus on the speaker-dependent problem. We compare a large set of machine learning algorithms, ranging from neural networks, Support Vector Machines or decision trees, together with 200 features, using a large database of several thousands examples. We show that the difference of performance among learning schemes can be substantial, and that some features which were previously unexplored are of crucial importance. An optimal feature set is derived through the use of a genetic algorithm. Finally, we explain how this study can be applied to real world situations in which very few examples are available. Furthermore, we describe a game to play with a personal robot which facilitates teaching of examples of emotional utterances in a natural and rather unconstrained manner.

[1]  Thierry Dutoit,et al.  MBR-PSOLA: Text-To-Speech synthesis based on an MBE re-synthesis of the segments database , 1993, Speech Commun..

[2]  K. Scherer,et al.  Acoustic profiles in vocal emotion expression. , 1996, Journal of personality and social psychology.

[3]  John L. Arnott,et al.  Implementation and testing of a system for producing emotion-by-rule in synthetic speech , 1995, Speech Commun..

[4]  Luc Steels,et al.  The synthetic modeling of language origins , 1997 .

[5]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[6]  Rosalind W. Picard Affective Computing , 1997 .

[7]  Iain R. Murray,et al.  Toward the simulation of emotion in synthetic speech: a review of the literature on human vocal emotion. , 1993, The Journal of the Acoustical Society of America.

[8]  P. Boersma ACCURATE SHORT-TERM ANALYSIS OF THE FUNDAMENTAL FREQUENCY AND THE HARMONICS-TO-NOISE RATIO OF A SAMPLED SOUND , 1993 .

[9]  Miriam Kienast,et al.  Acoustical analysis of spectral and temporal changes in emotional speech , 2000 .

[10]  Malcolm Slaney,et al.  Baby Ears: a recognition system for affective vocalizations , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[11]  Åsa Abelin,et al.  Cross linguistic interpretation of emotional prosody , 2002 .

[12]  Sandra P. Whiteside,et al.  Simulated emotions: an acoustic study of voice and perturbation measures , 1998, ICSLP.

[13]  Pierre-Yves Oudeyer,et al.  Coupled Neural Maps for the Origins of Vowel Systems , 2001, ICANN.

[14]  D. E. Goldberg,et al.  Genetic Algorithms in Search , 1989 .

[15]  Machiko Kusahara The Art of Creating Subjective Reality: An Analysis of Japanese Digital Pets , 2001, Leonardo.

[16]  Ashok Samal,et al.  Automatic recognition and analysis of human faces and facial expressions: a survey , 1992, Pattern Recognit..

[17]  W. Sendlmeier,et al.  Verification of acoustical correlates of emotional speech using formant-synthesis , 2000 .

[18]  Alex Waibel,et al.  EMOTION-SENSITIVE HUMAN-COMPUTER INTERFACES , 2000 .

[19]  Richard T. Cauldwell WHERE DID THE ANGER GO? THE ROLE OF CONTEXT IN INTERPRETING EMOTION IN SPEECH , 2000 .

[20]  Roddy Cowie,et al.  Automatic recognition of emotion from voice: a rough benchmark , 2000 .

[21]  Cynthia Breazeal,et al.  Designing sociable robots , 2002 .

[22]  Dominic W. Massaro,et al.  MULTIMODAL EMOTION PERCEPTION: ANALOGOUS TO SPEECH PROCESSES , 2000 .

[23]  Simon Kirby,et al.  The Evolutionary Emergence of Language: Syntax Without Natural Selection: How Compositionality Emerges from Vocabulary in a Population of Learners , 2000 .

[24]  Nick Campbell,et al.  A Speech Synthesis System with Emotion for Assisting Communication , 2000 .

[25]  K. Stevens,et al.  Emotions and speech: some acoustical correlates. , 1972, The Journal of the Acoustical Society of America.

[26]  M. Studdert-Kennedy,et al.  Approaches to the Evolution of Language , 1999 .

[27]  M. Halliday Learning How to Mean: Explorations in the Development of Language , 1975 .

[28]  Hiroaki Kitano,et al.  Development of an Autonomous Quadruped Robot for Robot Entertainment , 1998, Auton. Robots.

[29]  Pierre-Yves Oudeyer,et al.  The cultural evolution of syntactic constraints in phonology , 2000 .

[30]  Allison Druin,et al.  Robots for Kids: Exploring New Technologies for Learning , 2000 .

[31]  Pierre-Yves Oudeyer,et al.  The Origins of Syllable Systems: An Operational Model , 2001 .