Transflower: probabilistic autoregressive dance generation with multimodal attention

Dance requires skillful composition of complex movements that follow rhythmic, tonal and timbral features of music. Formally, generating dance conditioned on a piece of music can be expressed as a problem of modelling a high-dimensional continuous motion signal, conditioned on an audio signal. In this work we make two contributions to tackle this problem. First, we present a novel probabilistic autoregressive architecture that models the distribution over future poses with a normalizing flow conditioned on previous poses as well as music context, using a multimodal transformer encoder. Second, we introduce the currently largest 3D dance-motion dataset, obtained with a variety of motion-capture technologies, and including both professional and casual dancers. Using this dataset, we compare our new model against two baselines, via objective metrics and a user study, and show that both the ability to model a probability distribution, as well as being able to attend over a large motion and music context are necessary to produce interesting, diverse, and realistic dance that matches the music.

[1]  Markus Schedl,et al.  ENHANCED BEAT TRACKING WITH CONTEXT-AWARE NEURAL NETWORKS , 2011 .

[2]  Mark Chen,et al.  Generative Pretraining From Pixels , 2020, ICML.

[3]  Yaser Sheikh,et al.  Talking With Hands 16.2M: A Large-Scale Dataset of Synchronized Body-Finger Motion and Audio for Conversational Motion Analysis and Synthesis , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Alexander Mathis,et al.  A Primer on Motion Capture with Deep Learning: Principles, Pitfalls, and Perspectives , 2020, Neuron.

[5]  Mohamed Elhoseiny,et al.  VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning , 2021 .

[6]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[7]  Takaaki Shiratori,et al.  FrankMocap: A Monocular 3D Whole-Body Pose Estimation System via Regression and Integration , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[8]  Katsu Yamane,et al.  Retrieval and Generation of Human Motions Based on Associative Model between Motion Symbols and Motion Labels , 2010 .

[9]  Mykel J. Kochenderfer,et al.  Normalizing Flow Policies for Multi-agent Systems , 2020, GameSec.

[10]  Jitendra Malik,et al.  SFV , 2018, ACM Trans. Graph..

[11]  Masataka Goto,et al.  MUSIC CONTENT DRIVEN AUTOMATED CHOREOGRAPHY WITH BEAT-WISE MOTION CONNECTIVITY CONSTRAINTS , 2015 .

[12]  Aäron van den Oord,et al.  Predicting Video with VQVAE , 2021, ArXiv.

[13]  Rachel McDonnell,et al.  Investigating the use of recurrent motion modelling for speech gesture generation , 2018, IVA.

[14]  Shriram K. Vasudevan,et al.  The Deep Learning Framework , 2021, Deep Learning.

[15]  Sanja Fidler,et al.  Learning to Generate Diverse Dance Motions with Transformer , 2020, ArXiv.

[16]  Yi Zhou,et al.  Auto-Conditioned Recurrent Networks for Extended Complex Human Motion Synthesis , 2017, ICLR.

[17]  Sergey Levine,et al.  VideoFlow: A Flow-Based Generative Model for Video , 2019, ArXiv.

[18]  Mari Romarheim Haugen Studying Rhythmical Structures in Norwegian Folk Music and Dance Using Motion Capture Technology: A Case Study of Norwegian Telespringar , 2014, Musikk og tradisjon.

[19]  Andrew M. Dai,et al.  Music Transformer: Generating Music with Long-Term Structure , 2018, ICLR.

[20]  Okan Arikan,et al.  Interactive motion generation from examples , 2002, ACM Trans. Graph..

[21]  Navdeep Jaitly,et al.  Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Ryan Prenger,et al.  Waveglow: A Flow-based Generative Network for Speech Synthesis , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Peter V. Gehler,et al.  Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image , 2016, ECCV.

[24]  Jonas Beskow,et al.  Style‐Controllable Speech‐Driven Gesture Synthesis Using Normalising Flows , 2020, Comput. Graph. Forum.

[25]  Florian Krebs,et al.  An Efficient State-Space Model for Joint Tempo and Meter Tracking , 2015, ISMIR.

[26]  Satoru Fukayama,et al.  AIST Dance Video Database: Multi-Genre, Multi-Dancer, and Multi-Camera Database for Dance Information Processing , 2019, ISMIR.

[27]  Danica Kragic,et al.  Deep Representation Learning for Human Motion Prediction and Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Ilya Sutskever,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[29]  Oussama Kanoun,et al.  Learned motion matching , 2020, ACM Trans. Graph..

[30]  Guillermo Sapiro,et al.  GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions , 2021, ArXiv.

[31]  Michiel van de Panne,et al.  Character controllers using motion VAEs , 2020, ACM Trans. Graph..

[32]  Emily S. Cross,et al.  Neurocognitive control in dance perception and performance. , 2012, Acta psychologica.

[33]  Nikolaus F. Troje,et al.  AMASS: Archive of Motion Capture As Surface Shapes , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Andrew Zisserman,et al.  Perceiver: General Perception with Iterative Attention , 2021, ICML.

[35]  Chris Donahue,et al.  Dance Dance Convolution , 2017, ICML.

[36]  Taku Komura,et al.  Phase-functioned neural networks for character control , 2017, ACM Trans. Graph..

[37]  Jessica K. Hodgins,et al.  Construction and optimal search of interpolated motion graphs , 2007, ACM Trans. Graph..

[38]  Youngwoo Yoon,et al.  A Large, Crowdsourced Evaluation of Gesture Generation Systems on Common Data: The GENEA Challenge 2020 , 2021, IUI.

[39]  Laura A. Carlson,et al.  When What You Hear Influences When You See , 2013, Psychological science.

[40]  Mathis Petrovich,et al.  Action-Conditioned 3D Human Motion Synthesis with Transformer VAE , 2021, ArXiv.

[41]  P. Pasquier,et al.  GrooveNet : Real-Time Music-Driven Dance Movement Generation using Artificial Neural Networks , 2017 .

[42]  Eduardo de Campos Valadares,et al.  Dancing to the music , 2000 .

[43]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[44]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  R Devon Hjelm,et al.  Leveraging exploration in off-policy algorithms via normalizing flows , 2019, CoRL.

[46]  Alexei Baevski,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[47]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[48]  N. Troje Decomposing biological motion: a framework for analysis and synthesis of human gait patterns. , 2002, Journal of vision.

[49]  KangKang Yin,et al.  Discovering diverse athletic jumping strategies , 2021, ACM Trans. Graph..

[50]  Laria Reynolds,et al.  Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm , 2021, CHI Extended Abstracts.

[51]  Prafulla Dhariwal,et al.  Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[52]  Sergey Levine,et al.  Continuous character control with low-dimensional embeddings , 2012, ACM Trans. Graph..

[53]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[54]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[55]  Yuval Tassa,et al.  Learning human behaviors from motion capture by adversarial imitation , 2017, ArXiv.

[56]  Dario Pavllo,et al.  QuaterNet: A Quaternion-based Recurrent Model for Human Motion , 2018, BMVC.

[57]  Taku Komura,et al.  A Deep Learning Framework for Character Motion Synthesis and Editing , 2016, ACM Trans. Graph..

[58]  Pieter Abbeel,et al.  Flow++: Improving Flow-Based Generative Models with Variational Dequantization and Architecture Design , 2019, ICML.

[59]  F. Sebastian Grassia,et al.  Practical Parameterization of Rotations Using the Exponential Map , 1998, J. Graphics, GPU, & Game Tools.

[60]  Ilya Sutskever,et al.  Jukebox: A Generative Model for Music , 2020, ArXiv.

[61]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[62]  Sergey Levine,et al.  AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control , 2021, ACM Trans. Graph..

[63]  Jessica K. Hodgins,et al.  Interactive control of avatars animated with human motion data , 2002, SIGGRAPH.

[64]  Tamim Asfour,et al.  The KIT whole-body human motion database , 2015, 2015 International Conference on Advanced Robotics (ICAR).

[65]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[66]  Eric Nalisnick,et al.  Normalizing Flows for Probabilistic Modeling and Inference , 2019, J. Mach. Learn. Res..

[67]  Sebastian Starke,et al.  Local motion phases for learning multi-contact character movements , 2020, ACM Trans. Graph..

[68]  Taku Komura,et al.  A Recurrent Variational Autoencoder for Human Motion Synthesis , 2017, BMVC.

[69]  Sergey Levine,et al.  DeepMimic , 2018, ACM Trans. Graph..

[70]  Lucas Kovar,et al.  Automated extraction and parameterization of motions in large data sets , 2004, ACM Trans. Graph..

[71]  Wei Chen,et al.  ChoreoNet: Towards Music to Dance Synthesis with Choreographic Action Unit , 2020, ACM Multimedia.

[72]  Aaron Hertzmann,et al.  Style-based inverse kinematics , 2004, ACM Trans. Graph..

[73]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[74]  Chih-Yi Chiu,et al.  Motion retrieval and its application to motion synthesis , 2004, 24th International Conference on Distributed Computing Systems Workshops, 2004. Proceedings..

[75]  Taesung Park,et al.  Semantic Image Synthesis With Spatially-Adaptive Normalization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[76]  Gustav Eje Henter,et al.  The Case for Translation-Invariant Self-Attention in Transformer-Based Language Models , 2021, ACL.

[77]  Lu Sheng,et al.  DanceNet3D: Music Based Dance Generation with Parametric Motion Transformer , 2021, ArXiv.

[78]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[79]  Maneesh Agrawala,et al.  Visual rhythm and beat , 2018, ACM Trans. Graph..

[80]  Felix Hill,et al.  Imitating Interactive Intelligence , 2020, ArXiv.

[81]  David J. Fleet,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE Gaussian Process Dynamical Model , 2007 .

[82]  Weidong Geng,et al.  Example-Based Automatic Music-Driven Conventional Dance Motion Synthesis , 2012, IEEE Transactions on Visualization and Computer Graphics.

[83]  Jitendra Malik,et al.  Recurrent Network Models for Human Dynamics , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[84]  Jia Jia,et al.  Dance with Melody: An LSTM-autoencoder Approach to Music-oriented Dance Synthesis , 2018, ACM Multimedia.

[85]  A. Holzapfel,et al.  Dancing Dots - Investigating the Link between Dancer and Musician in Swedish Folk Dance , 2019 .

[86]  Marc R. Thompson,et al.  Embodied Meter: Hierarchical Eigenmodes in Music-Induced Movement , 2010 .

[87]  Birgitta Burger,et al.  Influences of Rhythm- and Timbre-Related Musical Features on Characteristics of Music-Induced Movement , 2013, Front. Psychol..

[88]  VIDEOFLOW: A CONDITIONAL FLOW-BASED MODEL , 2019 .

[89]  Geraint A. Wiggins,et al.  Multilevel rhythms in multimodal communication , 2021, Philosophical Transactions of the Royal Society B.

[90]  Jonas Beskow,et al.  MoGlow , 2019, ACM Trans. Graph..

[91]  David A. Ross,et al.  Learn to Dance with AIST++: Music Conditioned 3D Dance Generation , 2021, ArXiv.

[92]  Youngwoo Yoon,et al.  Robots Learn Social Skills: End-to-End Learning of Co-Speech Gesture Generation for Humanoid Robots , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[93]  Lucas Kovar,et al.  Motion Graphs , 2002, ACM Trans. Graph..