论文信息 - Transflower: probabilistic autoregressive dance generation with multimodal attention

Transflower: probabilistic autoregressive dance generation with multimodal attention

Dance requires skillful composition of complex movements that follow rhythmic, tonal and timbral features of music. Formally, generating dance conditioned on a piece of music can be expressed as a problem of modelling a high-dimensional continuous motion signal, conditioned on an audio signal. In this work we make two contributions to tackle this problem. First, we present a novel probabilistic autoregressive architecture that models the distribution over future poses with a normalizing flow conditioned on previous poses as well as music context, using a multimodal transformer encoder. Second, we introduce the currently largest 3D dance-motion dataset, obtained with a variety of motion-capture technologies, and including both professional and casual dancers. Using this dataset, we compare our new model against two baselines, via objective metrics and a user study, and show that both the ability to model a probability distribution, as well as being able to attend over a large motion and music context are necessary to produce interesting, diverse, and realistic dance that matches the music.

[1] Markus Schedl,et al. ENHANCED BEAT TRACKING WITH CONTEXT-AWARE NEURAL NETWORKS , 2011 .

[2] Mark Chen,et al. Generative Pretraining From Pixels , 2020, ICML.

[3] Yaser Sheikh,et al. Talking With Hands 16.2M: A Large-Scale Dataset of Synchronized Body-Finger Motion and Audio for Conversational Motion Analysis and Synthesis , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[4] Alexander Mathis,et al. A Primer on Motion Capture with Deep Learning: Principles, Pitfalls, and Perspectives , 2020, Neuron.

[5] Mohamed Elhoseiny,et al. VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning , 2021 .

[6] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[7] Takaaki Shiratori,et al. FrankMocap: A Monocular 3D Whole-Body Pose Estimation System via Regression and Integration , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[8] Katsu Yamane,et al. Retrieval and Generation of Human Motions Based on Associative Model between Motion Symbols and Motion Labels , 2010 .

[9] Mykel J. Kochenderfer,et al. Normalizing Flow Policies for Multi-agent Systems , 2020, GameSec.

[10] Jitendra Malik,et al. SFV , 2018, ACM Trans. Graph..

[11] Masataka Goto,et al. MUSIC CONTENT DRIVEN AUTOMATED CHOREOGRAPHY WITH BEAT-WISE MOTION CONNECTIVITY CONSTRAINTS , 2015 .

[12] Aäron van den Oord,et al. Predicting Video with VQVAE , 2021, ArXiv.

[13] Rachel McDonnell,et al. Investigating the use of recurrent motion modelling for speech gesture generation , 2018, IVA.

[14] Shriram K. Vasudevan,et al. The Deep Learning Framework , 2021, Deep Learning.

[15] Sanja Fidler,et al. Learning to Generate Diverse Dance Motions with Transformer , 2020, ArXiv.

[16] Yi Zhou,et al. Auto-Conditioned Recurrent Networks for Extended Complex Human Motion Synthesis , 2017, ICLR.

[17] Sergey Levine,et al. VideoFlow: A Flow-Based Generative Model for Video , 2019, ArXiv.

[18] Mari Romarheim Haugen. Studying Rhythmical Structures in Norwegian Folk Music and Dance Using Motion Capture Technology: A Case Study of Norwegian Telespringar , 2014, Musikk og tradisjon.

[19] Andrew M. Dai,et al. Music Transformer: Generating Music with Long-Term Structure , 2018, ICLR.

[20] Okan Arikan,et al. Interactive motion generation from examples , 2002, ACM Trans. Graph..

[21] Navdeep Jaitly,et al. Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22] Ryan Prenger,et al. Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23] Peter V. Gehler,et al. Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image , 2016, ECCV.

[24] Jonas Beskow,et al. Style‐Controllable Speech‐Driven Gesture Synthesis Using Normalising Flows , 2020, Comput. Graph. Forum.

[25] Florian Krebs,et al. An Efficient State-Space Model for Joint Tempo and Meter Tracking , 2015, ISMIR.

[26] Satoru Fukayama,et al. AIST Dance Video Database: Multi-Genre, Multi-Dancer, and Multi-Camera Database for Dance Information Processing , 2019, ISMIR.

[27] Danica Kragic,et al. Deep Representation Learning for Human Motion Prediction and Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Ilya Sutskever,et al. Zero-Shot Text-to-Image Generation , 2021, ICML.

[29] Oussama Kanoun,et al. Learned motion matching , 2020, ACM Trans. Graph..

[30] Guillermo Sapiro,et al. GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions , 2021, ArXiv.

[31] Michiel van de Panne,et al. Character controllers using motion VAEs , 2020, ACM Trans. Graph..

[32] Emily S. Cross,et al. Neurocognitive control in dance perception and performance. , 2012, Acta psychologica.

[33] Nikolaus F. Troje,et al. AMASS: Archive of Motion Capture As Surface Shapes , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[34] Andrew Zisserman,et al. Perceiver: General Perception with Iterative Attention , 2021, ICML.

[35] Chris Donahue,et al. Dance Dance Convolution , 2017, ICML.

[36] Taku Komura,et al. Phase-functioned neural networks for character control , 2017, ACM Trans. Graph..

[37] Jessica K. Hodgins,et al. Construction and optimal search of interpolated motion graphs , 2007, ACM Trans. Graph..

[38] Youngwoo Yoon,et al. A Large, Crowdsourced Evaluation of Gesture Generation Systems on Common Data: The GENEA Challenge 2020 , 2021, IUI.

[39] Laura A. Carlson,et al. When What You Hear Influences When You See , 2013, Psychological science.

[40] Mathis Petrovich,et al. Action-Conditioned 3D Human Motion Synthesis with Transformer VAE , 2021, ArXiv.

[41] P. Pasquier,et al. GrooveNet : Real-Time Music-Driven Dance Movement Generation using Artificial Neural Networks , 2017 .

[42] Eduardo de Campos Valadares,et al. Dancing to the music , 2000 .

[43] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[44] Timo Aila,et al. A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45] R Devon Hjelm,et al. Leveraging exploration in off-policy algorithms via normalizing flows , 2019, CoRL.

[46] Alexei Baevski,et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[47] Yejin Choi,et al. The Curious Case of Neural Text Degeneration , 2019, ICLR.

[48] N. Troje. Decomposing biological motion: a framework for analysis and synthesis of human gait patterns. , 2002, Journal of vision.

[49] KangKang Yin,et al. Discovering diverse athletic jumping strategies , 2021, ACM Trans. Graph..

[50] Laria Reynolds,et al. Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm , 2021, CHI Extended Abstracts.

[51] Prafulla Dhariwal,et al. Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[52] Sergey Levine,et al. Continuous character control with low-dimensional embeddings , 2012, ACM Trans. Graph..

[53] Oriol Vinyals,et al. Neural Discrete Representation Learning , 2017, NIPS.

[54] Alex Graves,et al. Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[55] Yuval Tassa,et al. Learning human behaviors from motion capture by adversarial imitation , 2017, ArXiv.

[56] Dario Pavllo,et al. QuaterNet: A Quaternion-based Recurrent Model for Human Motion , 2018, BMVC.

[57] Taku Komura,et al. A Deep Learning Framework for Character Motion Synthesis and Editing , 2016, ACM Trans. Graph..

[58] Pieter Abbeel,et al. Flow++: Improving Flow-Based Generative Models with Variational Dequantization and Architecture Design , 2019, ICML.

[59] F. Sebastian Grassia,et al. Practical Parameterization of Rotations Using the Exponential Map , 1998, J. Graphics, GPU, & Game Tools.

[60] Ilya Sutskever,et al. Jukebox: A Generative Model for Music , 2020, ArXiv.

[61] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[62] Sergey Levine,et al. AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control , 2021, ACM Trans. Graph..

[63] Jessica K. Hodgins,et al. Interactive control of avatars animated with human motion data , 2002, SIGGRAPH.

[64] Tamim Asfour,et al. The KIT whole-body human motion database , 2015, 2015 International Conference on Advanced Robotics (ICAR).

[65] Georg Heigold,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[66] Eric Nalisnick,et al. Normalizing Flows for Probabilistic Modeling and Inference , 2019, J. Mach. Learn. Res..

[67] Sebastian Starke,et al. Local motion phases for learning multi-contact character movements , 2020, ACM Trans. Graph..

[68] Taku Komura,et al. A Recurrent Variational Autoencoder for Human Motion Synthesis , 2017, BMVC.

[69] Sergey Levine,et al. DeepMimic , 2018, ACM Trans. Graph..

[70] Lucas Kovar,et al. Automated extraction and parameterization of motions in large data sets , 2004, ACM Trans. Graph..

[71] Wei Chen,et al. ChoreoNet: Towards Music to Dance Synthesis with Choreographic Action Unit , 2020, ACM Multimedia.

[72] Aaron Hertzmann,et al. Style-based inverse kinematics , 2004, ACM Trans. Graph..

[73] Jeff Donahue,et al. Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[74] Chih-Yi Chiu,et al. Motion retrieval and its application to motion synthesis , 2004, 24th International Conference on Distributed Computing Systems Workshops, 2004. Proceedings..

[75] Taesung Park,et al. Semantic Image Synthesis With Spatially-Adaptive Normalization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[76] Gustav Eje Henter,et al. The Case for Translation-Invariant Self-Attention in Transformer-Based Language Models , 2021, ACL.

[77] Lu Sheng,et al. DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer , 2021, ArXiv.

[78] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[79] Maneesh Agrawala,et al. Visual rhythm and beat , 2018, ACM Trans. Graph..

[80] Felix Hill,et al. Imitating Interactive Intelligence , 2020, ArXiv.

[81] David J. Fleet,et al. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE Gaussian Process Dynamical Model , 2007 .

[82] Weidong Geng,et al. Example-Based Automatic Music-Driven Conventional Dance Motion Synthesis , 2012, IEEE Transactions on Visualization and Computer Graphics.

[83] Jitendra Malik,et al. Recurrent Network Models for Human Dynamics , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[84] Jia Jia,et al. Dance with Melody: An LSTM-autoencoder Approach to Music-oriented Dance Synthesis , 2018, ACM Multimedia.

[85] A. Holzapfel,et al. Dancing Dots - Investigating the Link between Dancer and Musician in Swedish Folk Dance , 2019 .

[86] Marc R. Thompson,et al. Embodied Meter: Hierarchical Eigenmodes in Music-Induced Movement , 2010 .

[87] Birgitta Burger,et al. Influences of Rhythm- and Timbre-Related Musical Features on Characteristics of Music-Induced Movement , 2013, Front. Psychol..

[88] VIDEOFLOW: A CONDITIONAL FLOW-BASED MODEL , 2019 .

[89] Geraint A. Wiggins,et al. Multilevel rhythms in multimodal communication , 2021, Philosophical Transactions of the Royal Society B.

[90] Jonas Beskow,et al. MoGlow , 2019, ACM Trans. Graph..

[91] David A. Ross,et al. Learn to Dance with AIST++: Music Conditioned 3D Dance Generation , 2021, ArXiv.

[92] Youngwoo Yoon,et al. Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[93] Lucas Kovar,et al. Motion Graphs , 2002, ACM Trans. Graph..