Exemplar-based Stylized Gesture Generation from Speech: An Entry to the GENEA Challenge 2022

We present our entry to the GENEA Challenge of 2022 on data-driven co-speech gesture generation. Our system is a neural network that generates gesture animation from an input audio file. The motion style generated by the model is extracted from an exemplar motion clip. Style is embedded in a latent space using a variational framework. This architecture allows for generating in styles unseen during training. Moreover, the probabilistic nature of our variational framework furthermore enables the generation of a variety of outputs given the same input, addressing the stochastic nature of gesture motion. The GENEA challenge evaluation showed that our model produces full-body motion with highly competitive levels of human-likeness.

[1]  Geehyuk Lee,et al.  SGToolkit: An Interactive Gesture Authoring Toolkit for Embodied Conversational Agents , 2021, UIST.

[2]  Guillermo Valle Pérez,et al.  Transflower , 2021, ACM Trans. Graph..

[3]  Youngwoo Yoon,et al.  A Large, Crowdsourced Evaluation of Gesture Generation Systems on Common Data: The GENEA Challenge 2020 , 2021, IUI.

[4]  Nicole L. Robinson,et al.  A Review of Evaluation Practices of Gesture Generation in Embodied Conversational Agents , 2021, IEEE Transactions on Human-Machine Systems.

[5]  Ugur Güdükbay,et al.  A Conversational Agent Framework with Multi-modal Personality Expression , 2021, ACM Transactions on Graphics.

[6]  Youngwoo Yoon,et al.  Speech gesture generation from the trimodal context of text, audio, and speaker identity , 2020, ACM Trans. Graph..

[7]  Yukiko I. Nakano,et al.  Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker Conditional-Mixture Approach , 2020, ECCV.

[8]  Oussama Kanoun,et al.  Learned motion matching , 2020, ACM Trans. Graph..

[9]  Dani Lischinski,et al.  Unpaired motion style transfer from video to animation , 2020, ACM Trans. Graph..

[10]  Jonas Beskow,et al.  Style‐Controllable Speech‐Driven Gesture Synthesis Using Normalising Flows , 2020, Comput. Graph. Forum.

[11]  Gustav Eje Henter,et al.  Gesticulator: A framework for semantically-aware speech-driven gesture generation , 2020, ICMI.

[12]  Michael Neff,et al.  Multi-objective adversarial gesture generation , 2019, MIG.

[13]  Yaser Sheikh,et al.  Talking With Hands 16.2M: A Large-Scale Dataset of Synchronized Body-Finger Motion and Audio for Conversational Motion Analysis and Synthesis , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[14]  Jianfeng Gao,et al.  On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.

[15]  Jitendra Malik,et al.  Learning Individual Styles of Conversational Gesture , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Georg Martius,et al.  Variational Autoencoders Pursue PCA Directions (by Accident) , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Tien-Tsin Wong,et al.  Two-stage sketch colorization , 2018, ACM Trans. Graph..

[18]  Rachel McDonnell,et al.  Investigating the use of recurrent motion modelling for speech gesture generation , 2018, IVA.

[19]  Ron J. Weiss,et al.  Hierarchical Generative Modeling for Controllable Speech Synthesis , 2018, ICLR.

[20]  Taku Komura,et al.  Mode-adaptive neural networks for quadruped motion control , 2018, ACM Trans. Graph..

[21]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[22]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[23]  Aaron C. Courville,et al.  Generative adversarial networks , 2014, Commun. ACM.

[24]  Diederik P. Kingma,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[25]  H. Seidel,et al.  Gesture modeling and animation based on a probabilistic re-creation of speaker style , 2008, TOGS.

[26]  R. Pally A Primary Role for Nonverbal Communication in Psychoanalysis , 2001 .

[27]  S. Goldin-Meadow,et al.  The role of gesture in communication and thinking , 1999, Trends in Cognitive Sciences.

[28]  Benjamin van Niekerk,et al.  Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis , 2021, ArXiv.

[29]  Christian Gütl,et al.  Passing a Non-verbal Turing Test: Evaluating Gesture Animations Generated from Speech , 2021, 2021 IEEE Virtual Reality and 3D User Interfaces (VR).

[30]  Algorithms to measure audio programme loudness and true-peak audio level , 2011 .