论文信息 - AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation

AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation

In recent years, image generation has shown a great leap in performance, where diffusion models play a central role. Although generating high-quality images, such models are mainly conditioned on textual descriptions. This begs the question:"how can we adopt such models to be conditioned on other modalities?". In this paper, we propose a novel method utilizing latent diffusion models trained for text-to-image-generation to generate images conditioned on audio recordings. Using a pre-trained audio encoding model, the proposed method encodes audio into a new token, which can be considered as an adaptation layer between the audio and text representations. Such a modeling paradigm requires a small number of trainable parameters, making the proposed approach appealing for lightweight optimization. Results suggest the proposed method is superior to the evaluated baseline methods, considering objective and subjective metrics. Code and samples are available at: https://pages.cs.huji.ac.il/adiyoss-lab/AudioToken.

[1] Kalyan Vasudev Alwala,et al. ImageBind: One Embedding Space To Bind Them All , 2023, ArXiv.

[2] Naman Goyal,et al. LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[3] S. Savarese,et al. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ArXiv.

[4] Jinyu Li,et al. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers , 2023, ArXiv.

[5] W. Freeman,et al. Muse: Text-To-Image Generation via Masked Generative Transformers , 2023, ICML.

[6] Yossi Adi,et al. I Hear Your True Colors: Image Guided Audio Generation , 2022, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Yossi Adi,et al. ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement , 2022, ArXiv.

[8] Yossi Adi,et al. Speaking Style Conversion With Discrete Self-Supervised Units , 2022, ArXiv.

[9] Alexander M. Rush,et al. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model , 2022, ArXiv.

[10] Yaniv Taigman,et al. Audio Language Modeling using Perceptually-Guided Discrete Representations , 2022, ArXiv.

[11] Yaniv Taigman,et al. AudioGen: Textually Guided Audio Generation , 2022, ICLR.

[12] Stan Z. Li,et al. A Survey on Generative Diffusion Model , 2022, ArXiv.

[13] Amit H. Bermano,et al. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion , 2022, ICLR.

[14] David J. Fleet,et al. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[15] Stella Rose Biderman,et al. VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance , 2022, ECCV.

[16] Prafulla Dhariwal,et al. Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[17] Yaniv Taigman,et al. Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors , 2022, ECCV.

[18] B. Ommer,et al. High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Prafulla Dhariwal,et al. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[20] Lior Wolf,et al. ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Abdel-rahman Mohamed,et al. Textless Speech Emotion Conversion using Discrete & Decomposed Representations , 2021, EMNLP.

[22] J. Bello,et al. Wav2CLIP: Learning Robust Audio Representations from Clip , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23] Jacek Ma'ndziuk,et al. Audio-to-Image Cross-Modal Generation , 2021, 2022 International Joint Conference on Neural Networks (IJCNN).

[24] Esa Rahtu,et al. Taming Visually Guided Sound Generation , 2021, BMVC.

[25] Prafulla Dhariwal,et al. Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[26] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[27] Prafulla Dhariwal,et al. Improved Denoising Diffusion Probabilistic Models , 2021, ICML.

[28] Emmanuel Dupoux,et al. On Generative Spoken Language Modeling from Raw Audio , 2021, Transactions of the Association for Computational Linguistics.

[29] Kristen Grauman,et al. VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30] B. Ommer,et al. Taming Transformers for High-Resolution Image Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Bryan Catanzaro,et al. DiffWave: A Versatile Diffusion Model for Audio Synthesis , 2020, ICLR.

[32] Pieter Abbeel,et al. Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[33] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[34] Andrew Zisserman,et al. Vggsound: A Large-Scale Audio-Visual Dataset , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[35] Tamir Hazan,et al. Factor Graph Attention , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Shun-Po Chuang,et al. Towards Audio to Scene Image Synthesis Using Generative Adversarial Network , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[37] Sanjeev Khudanpur,et al. X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[38] Sepp Hochreiter,et al. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[39] Kevin Gimpel,et al. Gaussian Error Linear Units (GELUs) , 2016 .