Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment

How does audio describe the world around us? In this paper, we propose a method for generating an image of a scene from sound. Our method addresses the challenges of dealing with the large gaps that often exist between sight and sound. We design a model that works by scheduling the learning procedure of each model component to associate audio-visual modalities despite their information gaps. The key idea is to enrich the audio features with visual information by learning to align audio to visual latent space. We translate the input audio to visual features, then use a pre-trained generator to produce an image. To further improve the quality of our generated images, we use sound source localization to select the audio-visual pairs that have strong cross-modal correlations. We obtain substantially better results on the VEGAS and VGGSound datasets than prior approaches. We also show that we can control our model's predictions by applying simple manipulations to the input waveform, or to the latent space.

[1]  Tae-Hyun Oh,et al.  Prefix Tuning for Automated Audio Captioning , 2023, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Junsik Kim,et al.  Event-Specific Audio-Visual Fusion Layers: A Simple and New Perspective on Video Understanding , 2023, 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[3]  Joon Son Chung,et al.  MarginNCE: Robust Sound Localization with a Negative Margin , 2022, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  D. Erhan,et al.  Phenaki: Variable Length Video Generation From Open Domain Textual Description , 2022, ICLR.

[5]  David J. Fleet,et al.  Imagen Video: High Definition Video Generation with Diffusion Models , 2022, ArXiv.

[6]  Ben Poole,et al.  DreamFusion: Text-to-3D using 2D Diffusion , 2022, ICLR.

[7]  Yaniv Taigman,et al.  Make-A-Video: Text-to-Video Generation without Text-Video Data , 2022, ICLR.

[8]  K. M. Yi,et al.  Estimating Visual Information From Audio Through Manifold Learning , 2022, ArXiv.

[9]  Tae-Hyun Oh,et al.  CLIP-Actor: Text-Driven Recommendation and Stylization for Animating Human Meshes , 2022, ECCV.

[10]  Andrew Owens,et al.  Learning Visual Styles from Audio-Visual Associations , 2022, ECCV.

[11]  Oriol Vinyals,et al.  Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.

[12]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[13]  Junsik Kim,et al.  Less Can Be More: Sound Source Localization With a Classification Model , 2022, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[14]  Wonmin Byeon,et al.  Sound-Guided Semantic Image Manipulation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  J. Bello,et al.  Wav2CLIP: Learning Robust Audio Representations from Clip , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Andrew Owens,et al.  Strumming to the Beat: Audio-Conditioned Contrastive Video Textures , 2021, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[17]  Ron Mokady,et al.  ClipCap: CLIP Prefix for Image Captioning , 2021, ArXiv.

[18]  Esa Rahtu,et al.  Taming Visually Guided Sound Generation , 2021, BMVC.

[19]  Yong Jae Lee,et al.  Collaging Class-specific GANs for Semantic Image Synthesis , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[20]  Michal Drozdzal,et al.  Instance-Conditioned GAN , 2021, NeurIPS.

[21]  Chang Zhou,et al.  CogView: Mastering Text-to-Image Generation via Transformers , 2021, NeurIPS.

[22]  Shih-Fu Chang,et al.  VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text , 2021, NeurIPS.

[23]  Yongqin Xian,et al.  Distilling Audio-Visual Knowledge by Compositional Contrastive Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Andrea Vedaldi,et al.  Localizing Visual Sounds the Hard Way , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Daniel Cohen-Or,et al.  StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Nuno Vasconcelos,et al.  Robust Audio-Visual Instance Discrimination , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[28]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[29]  Daniel Cohen-Or,et al.  Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Jianping Gou,et al.  Knowledge Distillation: A Survey , 2020, International Journal of Computer Vision.

[31]  N. Vasconcelos,et al.  Audio-Visual Instance Discrimination with Cross-Modal Agreement , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Tae-Hyun Oh,et al.  Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Climent Nadeu,et al.  Sound-to-Imagination: Unsupervised Crossmodal Translation Using Deep Dense Network Architecture , 2021, ArXiv.

[34]  Trevor Darrell,et al.  Benchmark for Compositional Text-to-Image Synthesis , 2021, NeurIPS Datasets and Benchmarks.

[35]  T. Winterbottom,et al.  On Modality Bias in the TVQA Dataset , 2020, BMVC.

[36]  Weiyao Lin,et al.  Discriminative Sounding Objects Localization via Self-supervised Audiovisual Matching , 2020, NeurIPS.

[37]  Andrew Owens,et al.  Self-Supervised Learning of Audio-Visual Objects from Video , 2020, ECCV.

[38]  Anoop Cherian,et al.  Sound2Sight: Generating Visual Dynamics from Sound and Context , 2020, ECCV.

[39]  Chuang Gan,et al.  Generating Visually Aligned Sound From Videos , 2020, IEEE Transactions on Image Processing.

[40]  Kun Su,et al.  Audeo: Audio Generation for a Silent Performance Video , 2020, NeurIPS.

[41]  Julien Mairal,et al.  Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , 2020, NeurIPS.

[42]  Andrew Zisserman,et al.  Vggsound: A Large-Scale Audio-Visual Dataset , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Bernard Ghanem,et al.  Self-Supervised Learning by Cross-Modal Audio-Video Clustering , 2019, NeurIPS.

[44]  Chuang Gan,et al.  Self-Supervised Moving Vehicle Tracking With Stereo Sound , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[45]  Jitendra Malik,et al.  Learning Individual Styles of Conversational Gesture , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Tae-Hyun Oh,et al.  Speech2Face: Learning the Face Behind a Voice , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Kristen Grauman,et al.  Co-Separating Sounds of Visual Objects , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[48]  Peter Wonka,et al.  Image2StyleGAN: How to Embed Images Into the StyleGAN Latent Space? , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[49]  Bo Zhao,et al.  Image Generation From Layout , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[51]  Shun-Po Chuang,et al.  Towards Audio to Scene Image Synthesis Using Generative Adversarial Network , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[52]  Xuelong Li,et al.  Deep Multimodal Clustering for Unsupervised Audiovisual Learning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[54]  Andrew Owens,et al.  Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.

[55]  Li Fei-Fei,et al.  Image Generation from Scene Graphs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[56]  Tae-Hyun Oh,et al.  Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[57]  Andrew Owens,et al.  Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning , 2017, International Journal of Computer Vision.

[58]  Andrew Zisserman,et al.  Objects that Sound , 2017, ECCV.

[59]  Chen Fang,et al.  Visual to Sound: Generating Natural Sound for Videos in the Wild , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[60]  Zhaoxiang Zhang,et al.  CMCGAN: A Uniform Framework for Cross-Modal Visual-Audio Mutual Generation , 2017, AAAI.

[61]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[62]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[63]  Andrew Zisserman,et al.  Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[64]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[65]  Chenliang Xu,et al.  Deep Cross-Modal Audio-Visual Generation , 2017, ACM Multimedia.

[66]  Antonio Torralba,et al.  SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[67]  Andrew Owens,et al.  Ambient Sound Provides Supervision for Visual Learning , 2016, ECCV.

[68]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[69]  Andrew Owens,et al.  Visually Indicated Sounds , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[71]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[73]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[74]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[75]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .