论文信息 - Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots

Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots

Co-speech gestures enhance interaction experiences between humans as well as between humans and robots. Most existing robots use rule-based speech-gesture association, but this requires human labor and prior knowledge of experts to be implemented. We present a learning-based co-speech gesture generation that is learned from 52 h of TED talks. The proposed end-to-end neural network model consists of an encoder for speech text understanding and a decoder to generate a sequence of gestures. The model successfully produces various gestures including iconic, metaphoric, deictic, and beat gestures. In a subjective evaluation, participants reported that the gestures were human-like and matched the speech content. We also demonstrate a co-speech gesture with a NAO robot working in real time.

[1] Michael Neff,et al. A Multimodal Motion-Captured Corpus of Matched and Mismatched Extravert-Introvert Conversational Pairs , 2016, LREC.

[2] A. Woodward,et al. Motor System Activation Predicts Goal Imitation in 7-Month-Old Infants , 2016, Psychological science.

[3] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[4] Tamim Asfour,et al. Learning a bidirectional mapping between human whole-body motion and natural language using deep recurrent neural networks , 2017, Robotics Auton. Syst..

[5] Michael Kipp,et al. Gesture generation by imitation: from human behavior to computer character animation , 2005 .

[6] A. Kendon. Gesture: Visible Action as Utterance , 2004 .

[7] Dana Kulic,et al. Measurement Instruments for the Anthropomorphism, Animacy, Likeability, Perceived Intelligence, and Perceived Safety of Robots , 2009, Int. J. Soc. Robotics.

[8] Yang Liu,et al. MSP-AVATAR corpus: Motion capture recordings to study the role of discourse functions in the design of intelligent virtual agents , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[9] Ira Kemelmacher-Shlizerman,et al. Audio to Body Dynamics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10] Jeremy I. Skipper,et al. Co‐speech gestures influence neural activity in brain regions associated with processing semantic information , 2009, Human brain mapping.

[11] Yaser Sheikh,et al. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[13] Stefan Kopp,et al. To Err is Human(-like): Effects of Robot Gesture on Perceived Anthropomorphism and Likability , 2013, International Journal of Social Robotics.

[14] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[15] Carlos Busso,et al. Speech-driven Animation with Meaningful Behaviors , 2017, Speech Commun..

[16] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[17] Alexei A. Efros,et al. Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Takeo Kanade,et al. Panoptic Studio: A Massively Multiview System for Social Motion Capture , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[19] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[20] Bilge Mutlu,et al. Learning-Based Modeling of Multimodal Behaviors for Humanlike Robots , 2014, 2014 9th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[21] Jaakko Lehtinen,et al. Audio-driven facial animation by joint end-to-end learning of pose and emotion , 2017, ACM Trans. Graph..

[22] Justine Cassell,et al. BEAT: the Behavior Expression Animation Toolkit , 2001, Life-like characters.

[23] G. Beattie,et al. Do Iconic Hand Gestures Really Contribute to the Communication of Semantic Information in a Face-to-Face Context? , 2009 .

[24] C. Lawrence Zitnick,et al. Learn2Smile: Learning non-verbal interaction through observation , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).