SGToolkit: An Interactive Gesture Authoring Toolkit for Embodied Conversational Agents

Non-verbal behavior is essential for embodied agents like social robots, virtual avatars, and digital humans. Existing behavior authoring approaches including keyframe animation and motion capture are too expensive to use when there are numerous utterances requiring gestures. Automatic generation methods show promising results, but their output quality is not satisfactory yet, and it is hard to modify outputs as a gesture designer wants. We introduce a new gesture generation toolkit, named SGToolkit, which gives a higher quality output than automatic methods and is more efficient than manual authoring. For the toolkit, we propose a neural generative model that synthesizes gestures from speech and accommodates fine-level pose controls and coarse-level style controls from users. The user study with 24 participants showed that the toolkit is favorable over manual authoring, and the generated gestures were also human-like and appropriate to input speech. The SGToolkit is platform agnostic, and the code is available at https://github.com/ai4r/SGToolkit.

[1]  Sriram Subramanian,et al.  The effects of robot-performed co-verbal gesture on listener behaviour , 2011, 2011 11th IEEE-RAS International Conference on Humanoid Robots.

[2]  D. McNeill Hand and Mind , 1995 .

[3]  Maurizio Mancini,et al.  Implementing Expressive Gesture Synthesis for Embodied Conversational Agents , 2005, Gesture Workshop.

[4]  Yukiko I. Nakano,et al.  Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker Conditional-Mixture Approach , 2020, ECCV.

[5]  Trevor Darrell,et al.  Body2Hands: Learning to Infer 3D Hands from Conversational Gesture Body Dynamics , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Youngwoo Yoon,et al.  A Large, Crowdsourced Evaluation of Gesture Generation Systems on Common Data: The GENEA Challenge 2020 , 2021, IUI.

[7]  Method for the subjective assessment of intermediate quality level of , 2014 .

[8]  Youngwoo Yoon,et al.  Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[9]  Jonas Beskow,et al.  Style‐Controllable Speech‐Driven Gesture Synthesis Using Normalising Flows , 2020, Comput. Graph. Forum.

[10]  S. Levine,et al.  Gesture controllers , 2010, ACM Trans. Graph..

[11]  Alberto Menache,et al.  Understanding Motion Capture for Computer Animation and Video Games , 1999 .

[12]  J. Burgoon,et al.  Nonverbal Behaviors, Persuasion, and Credibility , 1990 .

[13]  Jaakko Lehtinen,et al.  Audio-driven facial animation by joint end-to-end learning of pose and emotion , 2017, ACM Trans. Graph..

[14]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[15]  R. Laban,et al.  The mastery of movement , 1950 .

[16]  Yuyu Xu,et al.  Virtual character performance from speech , 2013, SCA '13.

[17]  Justine Cassell,et al.  BEAT: the Behavior Expression Animation Toolkit , 2001, Life-like characters.

[18]  Sergey Levine,et al.  Gesture controllers , 2010, SIGGRAPH 2010.

[19]  Michael Neff,et al.  Multi-objective adversarial gesture generation , 2019, MIG.

[20]  Bilge Mutlu,et al.  Learning-Based Modeling of Multimodal Behaviors for Humanlike Robots , 2014, 2014 9th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[21]  Youngwoo Yoon,et al.  Speech gesture generation from the trimodal context of text, audio, and speaker identity , 2020, ACM Trans. Graph..

[22]  Gustav Eje Henter,et al.  Gesticulator: A framework for semantically-aware speech-driven gesture generation , 2020, ICMI.

[23]  Dario Pavllo,et al.  3D Human Pose Estimation in Video With Temporal Convolutions and Semi-Supervised Training , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Norman I. Badler,et al.  The EMOTE model for effort and shape , 2000, SIGGRAPH.

[25]  Michael S. Ryoo,et al.  Learning social affordance grammar from videos: Transferring human interactions to human-robot interactions , 2017, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[26]  Jitendra Malik,et al.  Learning Individual Styles of Conversational Gesture , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Stacy Marsella,et al.  SmartBody: behavior realization for embodied conversational agents , 2008, AAMAS.

[28]  Allison Sauppé,et al.  Robot Deictics: How Gesture and Context Shape Referential Communication , 2014, 2014 9th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[29]  M. Studdert-Kennedy Hand and Mind: What Gestures Reveal About Thought. , 1994 .

[30]  Justus Thies,et al.  Neural Voice Puppetry: Audio-driven Facial Reenactment , 2020, ECCV.

[31]  Dong Liu,et al.  Deep High-Resolution Representation Learning for Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[33]  Youngwoo Yoon,et al.  HEMVIP: Human Evaluation of Multiple Videos in Parallel , 2021, ICMI.

[34]  Ruigang Yang,et al.  Speech2Video Synthesis with 3D Skeleton Regularization and Expressive Body Poses , 2020 .

[35]  S. Marsella,et al.  Expressing Emotion Through Posture and Gesture , 2015 .

[36]  Matthias Scheutz,et al.  Hand Gestures and Verbal Acknowledgments Improve Human-Robot Rapport , 2017, ICSR.

[37]  Stefan Kopp,et al.  Towards a Common Framework for Multimodal Generation: The Behavior Markup Language , 2006, IVA.

[38]  Stacy Marsella,et al.  Nonverbal Behavior Generator for Embodied Conversational Agents , 2006, IVA.

[39]  Naoshi Kaneko,et al.  Analyzing Input and Output Representations for Speech-Driven Gesture Generation , 2019, IVA.

[40]  Anton Leuski,et al.  All Together Now - Introducing the Virtual Human Toolkit , 2013, IVA.

[41]  Michael Kipp,et al.  Gesture generation by imitation: from human behavior to computer character animation , 2005 .

[42]  Reid Simmons,et al.  Expressive motion with x, y and theta: Laban Effort Features for mobile robots , 2014, The 23rd IEEE International Symposium on Robot and Human Interactive Communication.