A Survey of Using Vocal Prosody to Convey Emotion in Robot Speech

The use of speech for robots to communicate with their human users has been facilitated by improvements in speech synthesis technology. Now that the intelligibility of synthetic speech has advanced to the point that speech synthesizers are a widely accepted and used technology, what are other aspects of speech synthesis that can be used to improve the quality of human-robot interaction? The communication of emotions through changes in vocal prosody is one way to make synthesized speech sound more natural. This article reviews the use of vocal prosody to convey emotions between humans, the use of vocal prosody by agents and avatars to convey emotions to their human users, and previous work within the human–robot interaction (HRI) community addressing the use of vocal prosody in robot speech. The goals of this article are (1) to highlight the ability and importance of using vocal prosody to convey emotions within robot speech and (2) to identify experimental design issues when using emotional robot speech in user studies.

[1]  Haizhou Li,et al.  Making Social Robots More Attractive: The Effects of Voice Pitch, Humor and Empathy , 2013, Int. J. Soc. Robotics.

[2]  John-Jules Ch. Meyer,et al.  Adaptive Emotional Expression in Robot-Child Interaction , 2014, 2014 9th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[3]  Russell Beale,et al.  Affective interaction: How emotional agents affect users , 2009, Int. J. Hum. Comput. Stud..

[4]  W. Sendlmeier,et al.  Verification of acoustical correlates of emotional speech using formant-synthesis , 2000 .

[5]  P. Ekman,et al.  Pan-Cultural Elements in Facial Displays of Emotion , 1969, Science.

[6]  Murray Alpert,et al.  Emotion in Speech: The Acoustic Attributes of Fear, Anger, Sadness, and Joy , 1999, Journal of psycholinguistic research.

[7]  Arthur C. Graesser,et al.  AutoTutor and affective autotutor: Learning by talking with cognitively and emotionally intelligent computers that talk back , 2012, TIIS.

[8]  Oudeyer Pierre-Yves,et al.  The production and recognition of emotions in speech: features and algorithms , 2003 .

[9]  Robin R. Murphy,et al.  Auditory and Other Non-verbal Expressions of Affect for Robots , 2006, AAAI Fall Symposium: Aurally Informed Performance.

[10]  Susan R. Fussell,et al.  Comparing a computer agent with a humanoid robot , 2007, 2007 2nd ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[11]  Kerstin Dautenhahn,et al.  What is a robot companion - friend, assistant or butler? , 2005, 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[12]  Ryad Chellali,et al.  Expressive synthetic voices: Considerations for human robot interaction , 2012, 2012 IEEE RO-MAN: The 21st IEEE International Symposium on Robot and Human Interactive Communication.

[13]  Bruce A. MacDonald,et al.  Expressive facial speech synthesis on a robotic platform , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[14]  Björn W. Schuller,et al.  Paralinguistics in speech and language - State-of-the-art and the challenge , 2013, Comput. Speech Lang..

[15]  L. F. Barrett Solving the Emotion Paradox: Categorization and the Experience of Emotion , 2006, Personality and social psychology review : an official journal of the Society for Personality and Social Psychology, Inc.

[16]  Roddy Cowie,et al.  Acoustic correlates of emotion dimensions in view of speech synthesis , 2001, INTERSPEECH.

[17]  D. Massaro,et al.  Perceiving affect from the voice and the face , 1996, Psychonomic bulletin & review.

[18]  Heiga Zen,et al.  Statistical parametric speech synthesis using deep neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[19]  Aijun Li,et al.  Prosody conversion from neutral speech to emotional speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[20]  G. Fairbanks,et al.  An experimental study of the pitch characteristics of the voice during the expression of emotion , 1939 .

[21]  A. Mehrabian Pleasure-arousal-dominance: A general framework for describing and measuring individual differences in Temperament , 1996 .

[22]  K. Scherer,et al.  Vocal expression of emotion. , 2003 .

[23]  Brian Scassellati,et al.  How people talk when teaching a robot , 2009, 2009 4th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[24]  Roland Siegwart,et al.  What do people expect from robots? , 2008, 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[25]  Tony Belpaeme,et al.  Non-Linguistic Utterances Should be Used Alongside Language, Rather than on their Own or as a Replacement , 2014, 2014 9th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[26]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.

[27]  J. Prinz Which Emotions Are Basic , 2007 .

[28]  K. Hammerschmidt,et al.  Acoustical correlates of affective prosody. , 2007, Journal of voice : official journal of the Voice Foundation.

[29]  Brian Scassellati,et al.  Robots that express emotion elicit better human teaching , 2011, 2011 6th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[30]  Maja J. Mataric,et al.  Embodiment and Human-Robot Interaction: A Task-Based Perspective , 2007, RO-MAN 2007 - The 16th IEEE International Symposium on Robot and Human Interactive Communication.

[31]  Tony Belpaeme,et al.  How to use non-linguistic utterances to convey emotion in child-robot interaction , 2012, 2012 7th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[32]  Clifford Nass,et al.  Improving automotive safety by pairing driver emotion and car voice emotion , 2005, CHI Extended Abstracts.

[33]  Heiga Zen,et al.  Deep Learning for Acoustic Modeling in Parametric Speech Generation: A systematic review of existing techniques and future trends , 2015, IEEE Signal Processing Magazine.

[34]  Catherine Pelachaud,et al.  EmotionML - An Upcoming Standard for Representing Emotions and Related States , 2011, ACII.

[35]  Kiyohiro Shikano,et al.  Robots that can hear, understand and talk , 2004, Adv. Robotics.

[36]  Alex Pentland,et al.  Social signal processing: state-of-the-art and future perspectives of an emerging domain , 2008, ACM Multimedia.

[37]  Nick Campbell,et al.  A Speech Synthesis System with Emotion for Assisting Communication , 2000 .

[38]  G. Huttar Relations between prosodic variables and emotions in normal American English utterances. , 1968, Journal of speech and hearing research.

[39]  Cynthia Breazeal,et al.  Recognition of Affective Communicative Intent in Robot-Directed Speech , 2002, Auton. Robots.

[40]  Brian Scassellati,et al.  The Physical Presence of a Robot Tutor Increases Cognitive Learning Gains , 2012, CogSci.

[41]  N. Amir,et al.  Is there a dominant channel in perception of emotions? , 2009, 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops.

[42]  Michael Picheny,et al.  The IBM expressive text-to-speech synthesis system for American English , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[43]  Tetsuya Takiguchi,et al.  GMM-Based Emotional Voice Conversion Using Spectrum and Prosody Features , 2012 .

[44]  Marc Schröder,et al.  Evaluation of Expressive Speech Synthesis With Voice Conversion and Copy Resynthesis Techniques , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[45]  Roddy Cowie,et al.  Describing the emotional states that are expressed in speech , 2003, Speech Commun..

[46]  Judy C. Pearson,et al.  An Introduction to Human Communication: Understanding & Sharing , 1996 .

[47]  Brian Scassellati,et al.  How to build robots that make friends and influence people , 1999, Proceedings 1999 IEEE/RSJ International Conference on Intelligent Robots and Systems. Human and Environment Friendly Robots with High Intelligence and Emotional Quotients (Cat. No.99CH36289).

[48]  Heiga Zen,et al.  Statistical Parametric Speech Synthesis , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[49]  Cindy L. Bethel,et al.  Conveying emotion in robotic speech: Lessons learned , 2014, The 23rd IEEE International Symposium on Robot and Human Interactive Communication.

[50]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[51]  Donna Erickson,et al.  Expressive speech: Production, perception and application to speech synthesis , 2005 .

[52]  K. Scherer Psychological models of emotion. , 2000 .

[53]  R. Dillmann,et al.  Using gesture and speech control for commanding a robot assistant , 2002, Proceedings. 11th IEEE International Workshop on Robot and Human Interactive Communication.

[54]  J. Russell,et al.  Core affect, prototypical emotional episodes, and other things called emotion: dissecting the elephant. , 1999, Journal of personality and social psychology.

[55]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[56]  Thomas S. Huang,et al.  Two-stage prosody prediction for emotional text-to-speech synthesis , 2008, INTERSPEECH.

[57]  Bruce MacDonald,et al.  Towards Expressive Speech Synthesis in English on a Robotic Platform , 2006 .

[58]  Pierre-Yves Oudeyer,et al.  The production and recognition of emotions in speech: features and algorithms , 2003, Int. J. Hum. Comput. Stud..

[59]  K. Scherer,et al.  Vocal cues in emotion encoding and decoding , 1991 .

[60]  Hadas Kress-Gazit,et al.  Make it So: Continuous, Flexible Natural Language Interaction with an Autonomous Robot , 2012, AAAI 2012.

[61]  P. Greasley,et al.  Emotion in Language and Speech: Methodological Issues in Naturalistic Approaches , 2000, Language and speech.

[62]  Tony Belpaeme,et al.  Situational Context Directs How People Affectively Interpret Robotic Non-Linguistic Utterances , 2014, 2014 9th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[63]  Nilanjan Sarkar,et al.  Emotion-sensitive robots - a new paradigm for human-robot interaction , 2004, 4th IEEE/RAS International Conference on Humanoid Robots, 2004..

[64]  G. Ayers,et al.  Guidelines for ToBI labelling , 1994 .

[65]  Henry Lieberman,et al.  A model of textual affect sensing using real-world knowledge , 2003, IUI '03.

[66]  K. Scherer Vocal affect expression: a review and a model for future research. , 1986, Psychological bulletin.

[67]  Cecilia Ovesdotter Alm,et al.  Emotions from Text: Machine Learning for Text-based Emotion Prediction , 2005, HLT.

[68]  Petri Laukka Vocal Expression of Emotion Discrete-emotions and Dimensional Accounts , 2004 .

[69]  Cynthia Breazeal,et al.  Effect of a robot on user perceptions , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[70]  Andrew Hunt,et al.  A new W3C markup standard for text-to-speech synthesis , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[71]  Dominic W. Massaro,et al.  The logic of the fuzzy logical model of perception , 1989 .

[72]  Cindy L. Bethel,et al.  Validation of vocal prosody modifications to communicate emotion in robot speech , 2015, 2015 International Conference on Collaboration Technologies and Systems (CTS).

[73]  Martine Grice,et al.  The SUS test: A method for the assessment of text-to-speech synthesis intelligibility using Semantically Unpredictable Sentences , 1996, Speech Commun..

[74]  Joseph Bates,et al.  The role of emotion in believable agents , 1994, CACM.

[75]  Catherine Pelachaud,et al.  Emotion Markup Language , 2015 .

[76]  Catherine Pelachaud,et al.  Emotion markup language (EmotionML) 1.0. W3C last call working draft , 2011 .

[77]  R. Frick Communicating emotion: The role of prosodic features. , 1985 .

[78]  Kjell Elenius,et al.  Expression of affect in spontaneous speech: Acoustic correlates and automatic detection of irritation and resignation , 2011, Comput. Speech Lang..

[79]  Matthias Scheutz,et al.  A mismatch in the human realism of face and voice produces an uncanny valley , 2011, i-Perception.

[80]  Tony Belpaeme,et al.  People Interpret Robotic Non-linguistic Utterances Categorically , 2013, 2013 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[81]  John L. Arnott,et al.  Emotional stress in synthetic speech: Progress and future directions , 1996, Speech Commun..

[82]  Tony Belpaeme,et al.  Interpreting non-linguistic utterances by robots: studying the influence of physical appearance , 2010, AFFINE '10.

[83]  H. Zen,et al.  An HMM-based speech synthesis system applied to English , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[84]  Brian Scassellati,et al.  The Benefits of Interactions with Physically Present Robots over Video-Displayed Agents , 2011, Int. J. Soc. Robotics.

[85]  K. Scherer,et al.  Vocal expression and communication of emotion. , 1993 .

[86]  K. M. Lee,et al.  Effects of Physical Embodiment on Social Presence of Social Robots , 2004 .