论文信息 - Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker Conditional-Mixture Approach

Style Transfer for Co-Speech Gesture Animation: A Multi-Speaker Conditional-Mixture Approach

How can we teach robots or virtual assistants to gesture naturally? Can we go further and adapt the gesturing style to follow a specific speaker? Gestures that are naturally timed with corresponding speech during human communication are called co-speech gestures. A key challenge, called gesture style transfer, is to learn a model that generates these gestures for a speaking agent 'A' in the gesturing style of a target speaker 'B'. A secondary goal is to simultaneously learn to generate co-speech gestures for multiple speakers while remembering what is unique about each speaker. We call this challenge style preservation. In this paper, we propose a new model, named Mix-StAGE, which trains a single model for multiple speakers while learning unique style embeddings for each speaker's gestures in an end-to-end manner. A novelty of Mix-StAGE is to learn a mixture of generative models which allows for conditioning on the unique gesture style of each speaker. As Mix-StAGE disentangles style and content of gestures, gesturing styles for the same input speech can be altered by simply switching the style embeddings. Mix-StAGE also allows for style preservation when learning simultaneously from multiple speakers. We also introduce a new dataset, Pose-Audio-Transcript-Style (PATS), designed to study gesture generation and style transfer. Our proposed Mix-StAGE model significantly outperforms the previous state-of-the-art approach for gesture generation and provides a path towards performing gesture style transfer across multiple speakers. Link to code, data, and videos: this http URL

[1] Jitendra Malik,et al. Learning Individual Styles of Conversational Gesture , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Michael Neff,et al. Multi-objective adversarial gesture generation , 2019, MIG.

[3] Yaser Sheikh,et al. To React or not to React: End-to-End Visual Pose Forecasting for Personalized Avatar during Dyadic Conversations , 2019, ICMI.

[4] Wojciech Zaremba,et al. Improved Techniques for Training GANs , 2016, NIPS.

[5] 拓海杉山,et al. “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[6] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Alexei A. Efros,et al. Toward Multimodal Image-to-Image Translation , 2017, NIPS.

[8] Ira Kemelmacher-Shlizerman,et al. Audio to Body Dynamics , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9] Li Fei-Fei,et al. Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[10] Kazuhiko Sumi,et al. Evaluation of Speech-to-Gesture Generation Using Bi-Directional LSTM Network , 2018, IVA.

[11] Bernt Schiele,et al. 2D Human Pose Estimation: New Benchmark and State of the Art Analysis , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[12] Léon Bottou,et al. Wasserstein GAN , 2017, ArXiv.

[13] Thomas Brox,et al. U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[14] Yaser Sheikh,et al. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15] Christian Obermeier,et al. A speaker's gesture style can affect language comprehension: ERP evidence from gesture-speech integration. , 2015, Social cognitive and affective neuroscience.

[16] Sergey Levine,et al. Real-time prosody-driven synthesis of body language , 2009, ACM Trans. Graph..

[17] Joon Son Chung,et al. Disentangled Speech Embeddings Using Cross-Modal Self-Supervision , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Hans-Peter Seidel,et al. Annotated New Text Engine Animation Animation Lexicon Animation Gesture Profiles MR : . . . JL : . . . Gesture Generation Video Annotated Gesture Script , 2007 .

[19] Douglas A. Reynolds,et al. Gaussian Mixture Models , 2018, Encyclopedia of Biometrics.

[20] Ming-Yu Liu,et al. Coupled Generative Adversarial Networks , 2016, NIPS.

[21] Yu-Ding Lu,et al. DRIT++: Diverse Image-to-Image Translation via Disentangled Representations , 2020, International Journal of Computer Vision.

[22] Inbar Mosseri,et al. XGAN: Unsupervised Image-to-Image Translation for many-to-many Mappings , 2017, Domain Adaptation for Visual Understanding.

[23] Stefan Kopp,et al. Increasing the expressiveness of virtual agents: autonomous generation of speech and gesture for spatial description tasks , 2009, AAMAS.

[24] Robert O. Davis,et al. Sometimes more is better: Agent gestures, procedural knowledge and the foreign language learner , 2019, Br. J. Educ. Technol..

[25] Vighnesh Birodkar,et al. Unsupervised Learning of Disentangled Representations from Video , 2017, NIPS.

[26] Leon A. Gatys,et al. A Neural Algorithm of Artistic Style , 2015, ArXiv.

[27] Stacy Marsella,et al. Gesture generation with low-dimensional embeddings , 2014, AAMAS.

[28] Louis-Philippe Morency,et al. Language2Pose: Natural Language Grounded Pose Forecasting , 2019, 2019 International Conference on 3D Vision (3DV).

[29] Alexei A. Efros,et al. Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Shakir Mohamed,et al. Variational Approaches for Auto-Encoding Generative Adversarial Networks , 2017, ArXiv.

[31] Justine Cassell,et al. BEAT: the Behavior Expression Animation Toolkit , 2001, Life-like characters.

[32] S. P. Lloyd,et al. Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[33] A. Murat Tekalp,et al. Analysis of Head Gesture and Prosody Patterns for Prosody-Driven Head-Gesture Animation , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34] Alexei A. Efros,et al. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[35] Yongguo Kang,et al. Multi-reference Tacotron by Intercross Training for Style Disentangling, Transfer and Control in Speech Synthesis , 2019, ArXiv.

[36] Yaser Sheikh,et al. Hand Keypoint Detection in Single Images Using Multiview Bootstrapping , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37] Yingyu Liang,et al. Generalization and Equilibrium in Generative Adversarial Nets (GANs) , 2017, ICML.

[38] Yaser Sheikh,et al. Recycle-GAN: Unsupervised Video Retargeting , 2018, ECCV.

[39] Seunghoon Hong,et al. Decomposing Motion and Content for Natural Video Sequence Prediction , 2017, ICLR.

[40] Jan Kautz,et al. Multimodal Unsupervised Image-to-Image Translation , 2018, ECCV.

[41] Jan Kautz,et al. Unsupervised Image-to-Image Translation Networks , 2017, NIPS.

[42] Yuxuan Wang,et al. Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis , 2018, ICML.

[43] Wei-Shi Zheng,et al. MIXGAN: Learning Concepts from Different Domains for Mixture Generation , 2018, IJCAI.

[44] Carlos Busso,et al. Novel Realizations of Speech-Driven Head Movements with Generative Adversarial Networks , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[45] Trung Le,et al. MGAN: Training Generative Adversarial Nets with Multiple Generators , 2018, ICLR.

[46] Jeremy N. Bailenson,et al. The Effect of Behavioral Realism and Form Realism of Real-Time Avatar Faces on Verbal Disclosure, Nonverbal Disclosure, Emotion Recognition, and Copresence in Dyadic Interaction , 2006, PRESENCE: Teleoperators and Virtual Environments.

[47] Stacy Marsella,et al. Predicting Co-verbal Gestures: A Deep and Temporal Modeling Approach , 2015, IVA.

[48] Yoshua Bengio,et al. Generative Adversarial Nets , 2014, NIPS.

[49] Eduardo de Campos Valadares,et al. Dancing to the music , 2000 .

[50] Stefan Kopp,et al. Gesture and speech in interaction: An overview , 2014, Speech Commun..

[51] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[52] Daniel McDuff,et al. Neural TTS Stylization with Adversarial and Collaborative Games , 2018, ICLR.

[53] Naoshi Kaneko,et al. Analyzing Input and Output Representations for Speech-Driven Gesture Generation , 2019, IVA.

[54] A. Braun,et al. Symbolic gestures and spoken language are processed by a common neural system , 2009, Proceedings of the National Academy of Sciences.

[55] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[56] Yaser Sheikh,et al. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57] Catherine Pelachaud,et al. Studies on gesture expressivity for a virtual agent , 2009, Speech Commun..

[58] S. Levine,et al. Gesture controllers , 2010, ACM Trans. Graph..

[59] Sai Krishna Rallabandi,et al. Disentangling Speech and Non-Speech Components for Building Robust Acoustic Models from Found Data , 2019, ArXiv.

[60] M. Studdert-Kennedy. Hand and Mind: What Gestures Reveal About Thought. , 1994 .

[61] Yingying Wang,et al. Efficient Neural Networks for Real-time Motion Style Transfer , 2019, PACMCGIT.