Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots

Co-speech gestures enhance interaction experiences between humans as well as between humans and robots. Most existing robots use rule-based speech-gesture association, but this requires human labor and prior knowledge of experts to be implemented. We present a learning-based co-speech gesture generation that is learned from 52 h of TED talks. The proposed end-to-end neural network model consists of an encoder for speech text understanding and a decoder to generate a sequence of gestures. The model successfully produces various gestures including iconic, metaphoric, deictic, and beat gestures. In a subjective evaluation, participants reported that the gestures were human-like and matched the speech content. We also demonstrate a co-speech gesture with a NAO robot working in real time.

[1]  Michael Neff,et al.  A Multimodal Motion-Captured Corpus of Matched and Mismatched Extravert-Introvert Conversational Pairs , 2016, LREC.

[2]  A. Woodward,et al.  Motor System Activation Predicts Goal Imitation in 7-Month-Old Infants , 2016, Psychological science.

[3]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[4]  Tamim Asfour,et al.  Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks , 2017, Robotics Auton. Syst..

[5]  Michael Kipp,et al.  Gesture generation by imitation: from human behavior to computer character animation , 2005 .

[6]  A. Kendon Gesture: Visible Action as Utterance , 2004 .

[7]  Dana Kulic,et al.  Measurement Instruments for the Anthropomorphism, Animacy, Likeability, Perceived Intelligence, and Perceived Safety of Robots , 2009, Int. J. Soc. Robotics.

[8]  Yang Liu,et al.  MSP-AVATAR corpus: Motion capture recordings to study the role of discourse functions in the design of intelligent virtual agents , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[9]  Ira Kemelmacher-Shlizerman,et al.  Audio to Body Dynamics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Jeremy I. Skipper,et al.  Co‐speech gestures influence neural activity in brain regions associated with processing semantic information , 2009, Human brain mapping.

[11]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[13]  Stefan Kopp,et al.  To Err is Human(-like): Effects of Robot Gesture on Perceived Anthropomorphism and Likability , 2013, International Journal of Social Robotics.

[14]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[15]  Carlos Busso,et al.  Speech-driven Animation with Meaningful Behaviors , 2017, Speech Commun..

[16]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[17]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Takeo Kanade,et al.  Panoptic Studio: A Massively Multiview System for Social Motion Capture , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[20]  Bilge Mutlu,et al.  Learning-Based Modeling of Multimodal Behaviors for Humanlike Robots , 2014, 2014 9th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[21]  Jaakko Lehtinen,et al.  Audio-driven facial animation by joint end-to-end learning of pose and emotion , 2017, ACM Trans. Graph..

[22]  Justine Cassell,et al.  BEAT: the Behavior Expression Animation Toolkit , 2001, Life-like characters.

[23]  G. Beattie,et al.  Do Iconic Hand Gestures Really Contribute to the Communication of Semantic Information in a Face-to-Face Context? , 2009 .

[24]  C. Lawrence Zitnick,et al.  Learn2Smile: Learning non-verbal interaction through observation , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).