Conveying emotion in robotic speech: Lessons learned

This research explored whether robots can use modern speech synthesizers to convey emotion with their speech. We investigated the use of MARY, an open source speech synthesizer, to convey a robot's emotional intent to novice robot users. The first experiment indicated that participants were able to distinguish the intended emotions of anger, calm, fear, and sadness with success rates of 65.9%, 68.9%, 33.3%, and 49.2%, respectively. An issue was the recognition rate of the intended happiness statements, 18.2%, which was below the 20% level determined for chance. The vocal prosody modifications for the expression of happiness were adjusted and the recognition rates for happiness improved to 30.3% in a second experiment. This is an important benchmarking step in a line of research that investigates the use of emotional speech by robots to improve human-robot interaction. Recommendations and lessons learned from this research are presented.

[1]  G. Fairbanks,et al.  An experimental study of the pitch characteristics of the voice during the expression of emotion , 1939 .

[2]  Robin R. Murphy,et al.  Survivor buddy: A social medium robot , 2011, 2011 6th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[3]  Arne Jönsson,et al.  Wizard of Oz studies: why and how , 1993, IUI '93.

[4]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[5]  Alex Acero,et al.  Spoken Language Processing: A Guide to Theory, Algorithm and System Development , 2001 .

[6]  P. Ekman,et al.  Pan-Cultural Elements in Facial Displays of Emotion , 1969, Science.

[7]  Sonja A. Kotz,et al.  Factors in the recognition of vocally expressed emotions: A comparison of four languages , 2009, J. Phonetics.

[8]  Murray Alpert,et al.  Emotion in Speech: The Acoustic Attributes of Fear, Anger, Sadness, and Joy , 1999, Journal of psycholinguistic research.

[9]  Tony Belpaeme,et al.  People Interpret Robotic Non-linguistic Utterances Categorically , 2013, 2013 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[10]  K. Scherer Vocal affect expression: a review and a model for future research. , 1986, Psychological bulletin.

[11]  Marc Schröder,et al.  The German Text-to-Speech Synthesis System MARY: A Tool for Research, Development and Teaching , 2003, Int. J. Speech Technol..

[12]  R. Frick Communicating emotion: The role of prosodic features. , 1985 .

[13]  Martine Grice,et al.  The SUS test: A method for the assessment of text-to-speech synthesis intelligibility using Semantically Unpredictable Sentences , 1996, Speech Commun..

[14]  G. Ayers,et al.  Guidelines for ToBI labelling , 1994 .

[15]  R. Dillmann,et al.  Using gesture and speech control for commanding a robot assistant , 2002, Proceedings. 11th IEEE International Workshop on Robot and Human Interactive Communication.

[16]  K. Scherer,et al.  Vocal cues in emotion encoding and decoding , 1991 .

[17]  Hadas Kress-Gazit,et al.  Make it So: Continuous, Flexible Natural Language Interaction with an Autonomous Robot , 2012, AAAI 2012.

[18]  Tony Belpaeme,et al.  How to use non-linguistic utterances to convey emotion in child-robot interaction , 2012, 2012 7th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[19]  Cynthia Breazeal,et al.  Recognition of Affective Communicative Intent in Robot-Directed Speech , 2002, Auton. Robots.

[20]  K. Hammerschmidt,et al.  Acoustical correlates of affective prosody. , 2007, Journal of voice : official journal of the Voice Foundation.

[21]  Brian Scassellati,et al.  How people talk when teaching a robot , 2009, 2009 4th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[22]  H. Zen,et al.  An HMM-based speech synthesis system applied to English , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[23]  Kiyohiro Shikano,et al.  Robots that can hear, understand and talk , 2004, Adv. Robotics.

[24]  Alex Pentland,et al.  Social signal processing: state-of-the-art and future perspectives of an emerging domain , 2008, ACM Multimedia.

[25]  G. Huttar Relations between prosodic variables and emotions in normal American English utterances. , 1968, Journal of speech and hearing research.

[26]  W. Sendlmeier,et al.  Verification of acoustical correlates of emotional speech using formant-synthesis , 2000 .

[27]  Aijun Li,et al.  Prosody conversion from neutral speech to emotional speech , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[28]  Mari Ostendorf,et al.  TOBI: a standard for labeling English prosody , 1992, ICSLP.