The Power of Sound (TPoS): Audio Reactive Video Generation with Stable Diffusion

In recent years, video generation has become a prominent generative tool and has drawn significant attention. However, there is little consideration in audio-to-video generation, though audio contains unique qualities like temporal semantics and magnitude. Hence, we propose The Power of Sound (TPoS) model to incorporate audio input that includes both changeable temporal semantics and magnitude. To generate video frames, TPoS utilizes a latent stable diffusion model with textual semantic information, which is then guided by the sequential audio embedding from our pretrained Audio Encoder. As a result, this method produces audio reactive video contents. We demonstrate the effectiveness of TPoS across various tasks and compare its results with current state-of-the-art techniques in the field of audio-to-video generation. More examples are available at https://ku-vai.github.io/TPoS/

[1]  Y. Matias,et al.  Dreamix: Video Diffusion Models are General Video Editors , 2023, ArXiv.

[2]  Yang Zhang,et al.  Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  P. Schramowski,et al.  The Stable Artist: Steering Semantics in Diffusion Latent Space , 2022, ArXiv.

[4]  D. Erhan,et al.  Phenaki: Variable Length Video Generation From Open Domain Textual Description , 2022, ICLR.

[5]  David J. Fleet,et al.  Imagen Video: High Definition Video Generation with Diffusion Models , 2022, ArXiv.

[6]  Yaniv Taigman,et al.  Make-A-Video: Text-to-Video Generation without Text-Video Data , 2022, ICLR.

[7]  J. Tenenbaum,et al.  Prompt-to-Prompt Image Editing with Cross Attention Control , 2022, ICLR.

[8]  Dani Lischinski,et al.  Blended Latent Diffusion , 2022, ACM Trans. Graph..

[9]  J. Tenenbaum,et al.  Compositional Visual Generation with Composable Diffusion Models , 2022, ECCV.

[10]  Xiaoguang Han,et al.  Expressive Talking Head Generation with Granular Audio-Visual Control , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Wayne Wu,et al.  EAMM: One-Shot Emotional Talking Face via Audio-Based Emotion-Aware Motion Model , 2022, SIGGRAPH.

[12]  Sang Ho Yoon,et al.  Sound-Guided Semantic Video Generation , 2022, ECCV.

[13]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[14]  David J. Fleet,et al.  Video Diffusion Models , 2022, NeurIPS.

[15]  Mohamed Elhoseiny,et al.  StyleGAN-V: A Continuous Video Generator with the Price, Image Quality and Perks of StyleGAN2 , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Wonmin Byeon,et al.  Sound-Guided Semantic Image Manipulation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Cordelia Schmid,et al.  CCVS: Context-aware Controllable Video Synthesis , 2021, NeurIPS.

[19]  Jaakko Lehtinen,et al.  Alias-Free Generative Adversarial Networks , 2021, NeurIPS.

[20]  Xun Cao,et al.  Audio-Driven Emotional Video Portraits , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Mohamed Elhoseiny,et al.  Aligning Latent and Image Spaces to Connect the Unconnectable , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[23]  Dasaem Jeong,et al.  TräumerAI: Dreaming Music with StyleGAN , 2021, ArXiv.

[24]  C. V. Jawahar,et al.  A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild , 2020, ACM Multimedia.

[25]  Anoop Cherian,et al.  Sound2Sight: Generating Visual Dynamics from Sound and Context , 2020, ECCV.

[26]  Andrew Zisserman,et al.  Vggsound: A Large-Scale Audio-Visual Dataset , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Jun Zhu,et al.  Automatic Realistic Music Video Generation from Segments of Youtube Videos , 2019, ArXiv.

[28]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[29]  Sjoerd van Steenkiste,et al.  Towards Accurate Generative Models of Video: A New Metric & Challenges , 2018, ArXiv.

[30]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[31]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[33]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[34]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[36]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[38]  A. Linear-probe,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021 .