Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker Conditional-Mixture Approach

How can we teach robots or virtual assistants to gesture naturally? Can we go further and adapt the gesturing style to follow a specific speaker? Gestures that are naturally timed with corresponding speech during human communication are called co-speech gestures. A key challenge, called gesture style transfer, is to learn a model that generates these gestures for a speaking agent 'A' in the gesturing style of a target speaker 'B'. A secondary goal is to simultaneously learn to generate co-speech gestures for multiple speakers while remembering what is unique about each speaker. We call this challenge style preservation. In this paper, we propose a new model, named Mix-StAGE, which trains a single model for multiple speakers while learning unique style embeddings for each speaker's gestures in an end-to-end manner. A novelty of Mix-StAGE is to learn a mixture of generative models which allows for conditioning on the unique gesture style of each speaker. As Mix-StAGE disentangles style and content of gestures, gesturing styles for the same input speech can be altered by simply switching the style embeddings. Mix-StAGE also allows for style preservation when learning simultaneously from multiple speakers. We also introduce a new dataset, Pose-Audio-Transcript-Style (PATS), designed to study gesture generation and style transfer. Our proposed Mix-StAGE model significantly outperforms the previous state-of-the-art approach for gesture generation and provides a path towards performing gesture style transfer across multiple speakers. Link to code, data, and videos: this http URL

[1]  Jitendra Malik,et al.  Learning Individual Styles of Conversational Gesture , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Michael Neff,et al.  Multi-objective adversarial gesture generation , 2019, MIG.

[3]  Yaser Sheikh,et al.  To React or not to React: End-to-End Visual Pose Forecasting for Personalized Avatar during Dyadic Conversations , 2019, ICMI.

[4]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[5]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[6]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Alexei A. Efros,et al.  Toward Multimodal Image-to-Image Translation , 2017, NIPS.

[8]  Ira Kemelmacher-Shlizerman,et al.  Audio to Body Dynamics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[10]  Kazuhiko Sumi,et al.  Evaluation of Speech-to-Gesture Generation Using Bi-Directional LSTM Network , 2018, IVA.

[11]  Bernt Schiele,et al.  2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Léon Bottou,et al.  Wasserstein GAN , 2017, ArXiv.

[13]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[14]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Christian Obermeier,et al.  A speaker's gesture style can affect language comprehension: ERP evidence from gesture-speech integration. , 2015, Social cognitive and affective neuroscience.

[16]  Sergey Levine,et al.  Real-time prosody-driven synthesis of body language , 2009, ACM Trans. Graph..

[17]  Joon Son Chung,et al.  Disentangled Speech Embeddings Using Cross-Modal Self-Supervision , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Hans-Peter Seidel,et al.  Annotated New Text Engine Animation Animation Lexicon Animation Gesture Profiles MR : . . . JL : . . . Gesture Generation Video Annotated Gesture Script , 2007 .

[19]  Douglas A. Reynolds,et al.  Gaussian Mixture Models , 2018, Encyclopedia of Biometrics.

[20]  Ming-Yu Liu,et al.  Coupled Generative Adversarial Networks , 2016, NIPS.

[21]  Yu-Ding Lu,et al.  DRIT++: Diverse Image-to-Image Translation via Disentangled Representations , 2020, International Journal of Computer Vision.

[22]  Inbar Mosseri,et al.  XGAN: Unsupervised Image-to-Image Translation for many-to-many Mappings , 2017, Domain Adaptation for Visual Understanding.

[23]  Stefan Kopp,et al.  Increasing the expressiveness of virtual agents: autonomous generation of speech and gesture for spatial description tasks , 2009, AAMAS.

[24]  Robert O. Davis,et al.  Sometimes more is better: Agent gestures, procedural knowledge and the foreign language learner , 2019, Br. J. Educ. Technol..

[25]  Vighnesh Birodkar,et al.  Unsupervised Learning of Disentangled Representations from Video , 2017, NIPS.

[26]  Leon A. Gatys,et al.  A Neural Algorithm of Artistic Style , 2015, ArXiv.

[27]  Stacy Marsella,et al.  Gesture generation with low-dimensional embeddings , 2014, AAMAS.

[28]  Louis-Philippe Morency,et al.  Language2Pose: Natural Language Grounded Pose Forecasting , 2019, 2019 International Conference on 3D Vision (3DV).

[29]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Shakir Mohamed,et al.  Variational Approaches for Auto-Encoding Generative Adversarial Networks , 2017, ArXiv.

[31]  Justine Cassell,et al.  BEAT: the Behavior Expression Animation Toolkit , 2001, Life-like characters.

[32]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[33]  A. Murat Tekalp,et al.  Analysis of Head Gesture and Prosody Patterns for Prosody-Driven Head-Gesture Animation , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Alexei A. Efros,et al.  Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[35]  Yongguo Kang,et al.  Multi-reference Tacotron by Intercross Training for Style Disentangling, Transfer and Control in Speech Synthesis , 2019, ArXiv.

[36]  Yaser Sheikh,et al.  Hand Keypoint Detection in Single Images Using Multiview Bootstrapping , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Yingyu Liang,et al.  Generalization and Equilibrium in Generative Adversarial Nets (GANs) , 2017, ICML.

[38]  Yaser Sheikh,et al.  Recycle-GAN: Unsupervised Video Retargeting , 2018, ECCV.

[39]  Seunghoon Hong,et al.  Decomposing Motion and Content for Natural Video Sequence Prediction , 2017, ICLR.

[40]  Jan Kautz,et al.  Multimodal Unsupervised Image-to-Image Translation , 2018, ECCV.

[41]  Jan Kautz,et al.  Unsupervised Image-to-Image Translation Networks , 2017, NIPS.

[42]  Yuxuan Wang,et al.  Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[43]  Wei-Shi Zheng,et al.  MIXGAN: Learning Concepts from Different Domains for Mixture Generation , 2018, IJCAI.

[44]  Carlos Busso,et al.  Novel Realizations of Speech-Driven Head Movements with Generative Adversarial Networks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45]  Trung Le,et al.  MGAN: Training Generative Adversarial Nets with Multiple Generators , 2018, ICLR.

[46]  Jeremy N. Bailenson,et al.  The Effect of Behavioral Realism and Form Realism of Real-Time Avatar Faces on Verbal Disclosure, Nonverbal Disclosure, Emotion Recognition, and Copresence in Dyadic Interaction , 2006, PRESENCE: Teleoperators and Virtual Environments.

[47]  Stacy Marsella,et al.  Predicting Co-verbal Gestures: A Deep and Temporal Modeling Approach , 2015, IVA.

[48]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[49]  Eduardo de Campos Valadares,et al.  Dancing to the music , 2000 .

[50]  Stefan Kopp,et al.  Gesture and speech in interaction: An overview , 2014, Speech Commun..

[51]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[52]  Daniel McDuff,et al.  Neural TTS Stylization with Adversarial and Collaborative Games , 2018, ICLR.

[53]  Naoshi Kaneko,et al.  Analyzing Input and Output Representations for Speech-Driven Gesture Generation , 2019, IVA.

[54]  A. Braun,et al.  Symbolic gestures and spoken language are processed by a common neural system , 2009, Proceedings of the National Academy of Sciences.

[55]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[56]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57]  Catherine Pelachaud,et al.  Studies on gesture expressivity for a virtual agent , 2009, Speech Commun..

[58]  S. Levine,et al.  Gesture controllers , 2010, ACM Trans. Graph..

[59]  Sai Krishna Rallabandi,et al.  Disentangling Speech and Non-Speech Components for Building Robust Acoustic Models from Found Data , 2019, ArXiv.

[60]  M. Studdert-Kennedy Hand and Mind: What Gestures Reveal About Thought. , 1994 .

[61]  Yingying Wang,et al.  Efficient Neural Networks for Real-time Motion Style Transfer , 2019, PACMCGIT.