Self-supervised Dance Video Synthesis Conditioned on Music

We present a self-supervised approach with pose perceptual loss for automatic dance video generation. Our method can produce a realistic dance video that conforms to the beats and rhymes of given music. To achieve this, we firstly generate a human skeleton sequence from music and then apply the learned pose-to-appearance mapping to generate the final video. In the stage of generating skeleton sequences, we utilize two discriminators to capture different aspects of the sequence and propose a novel pose perceptual loss to produce natural dances. Besides, we also provide a new cross-modal evaluation metric to evaluate the dance quality, which is able to estimate the similarity between two modalities (music and dance). Finally, our experimental qualitative and quantitative results demonstrate that our dance video synthesis approach produces realistic and diverse results. Our source code and data are available at https://github.com/xrenaa/Music-Dance-Video-Synthesis.

[1]  Chi-Keung Tang,et al.  Deep Video Generation, Prediction and Completion of Human Action Sequences , 2017, ECCV.

[2]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[3]  Yitong Li,et al.  Video Generation From Text , 2017, AAAI.

[4]  Alexei A. Efros,et al.  Everybody Dance Now , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Vladlen Koltun,et al.  Photographic Image Synthesis with Cascaded Refinement Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[8]  Maja Pantic,et al.  Realistic Speech-Driven Facial Animation with GANs , 2019, International Journal of Computer Vision.

[9]  Jan Kautz,et al.  High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Mark Sandler,et al.  Convolutional recurrent neural networks for music classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[13]  Yann LeCun,et al.  Deep multi-scale video prediction beyond mean square error , 2015, ICLR.

[14]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[15]  Michael J. Black,et al.  On Human Motion Prediction Using Recurrent Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Wenhan Luo,et al.  Liquid Warping GAN: A Unified Framework for Human Motion Imitation, Appearance Transfer and Novel View Synthesis , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Chao Li,et al.  Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation , 2018, IJCAI.

[18]  Varun Ramakrishna,et al.  Convolutional Pose Machines , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Xu Chen,et al.  Actional-Structural Graph Convolutional Networks for Skeleton-Based Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Atsushi Nakazawa,et al.  Dancing‐to‐Music Character Animation , 2006, Comput. Graph. Forum.

[21]  Joan Bruna,et al.  Super-Resolution with Deep Convolutional Sufficient Statistics , 2015, ICLR.

[22]  Songhwai Oh,et al.  Generative Autoregressive Networks for 3D Dancing Move Synthesis From Music , 2019, IEEE Robotics and Automation Letters.

[23]  Ren Ng,et al.  Single Image Reflection Separation with Perceptual Losses , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Bowen Zhou,et al.  A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[25]  Jan Kautz,et al.  Video-to-Video Synthesis , 2018, NeurIPS.

[26]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[27]  Eduardo de Campos Valadares,et al.  Dancing to the music , 2000 .

[28]  Tieniu Tan,et al.  An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Kyogu Lee,et al.  Listen to Dance: Music-driven choreography generation using Autoregressive Encoder-Decoder Network , 2018, ArXiv.

[31]  Shunta Saito,et al.  Temporal Generative Adversarial Nets with Singular Value Clipping , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[32]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[33]  Jia Jia,et al.  Dance with Melody: An LSTM-autoencoder Approach to Music-oriented Dance Synthesis , 2018, ACM Multimedia.

[34]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[35]  Thomas Brox,et al.  Synthesizing the preferred inputs for neurons in neural networks via deep generator networks , 2016, NIPS.

[36]  Nassir Navab,et al.  Human Motion Analysis with Deep Metric Learning , 2018, ECCV.

[37]  Yizhou Yu,et al.  Audeosynth: Music-driven Video Montage , 2015, ACM Trans. Graph..

[38]  James K. Hahn,et al.  Making Them Dance , 2006, AAAI Fall Symposium: Aurally Informed Performance.

[39]  Tetsuya Ogata,et al.  Sequential Deep Learning for Dancing Motion Generation , 2016 .

[40]  Maja Pantic,et al.  End-to-End Speech-Driven Facial Animation with Temporal GANs , 2018, BMVC.

[41]  Jan Kautz,et al.  MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Chen Fang,et al.  Dance Dance Generation: Motion Transfer for Internet Videos , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[43]  Chenliang Xu,et al.  Lip Movements Generation at a Glance , 2018, ECCV.

[44]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[45]  Maneesh Agrawala,et al.  Generating emotionally relevant musical scores for audio stories , 2014, UIST.

[46]  Vladlen Koltun,et al.  Speech Denoising with Deep Feature Losses , 2018, INTERSPEECH.

[47]  Thomas Brox,et al.  Generating Images with Perceptual Similarity Metrics based on Deep Networks , 2016, NIPS.