AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation
暂无分享,去创建一个
Lior Wolf | Lior Wolf | Yossi Adi | Idan Schwartz | Itai Gat | Guy Yariv | Idan Schwartz | Guy Yariv
[1] Kalyan Vasudev Alwala,et al. ImageBind: One Embedding Space To Bind Them All , 2023, ArXiv.
[2] Naman Goyal,et al. LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.
[3] S. Savarese,et al. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ArXiv.
[4] Jinyu Li,et al. Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers , 2023, ArXiv.
[5] W. Freeman,et al. Muse: Text-To-Image Generation via Masked Generative Transformers , 2023, ICML.
[6] Yossi Adi,et al. I Hear Your True Colors: Image Guided Audio Generation , 2022, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[7] Yossi Adi,et al. ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement , 2022, ArXiv.
[8] Yossi Adi,et al. Speaking Style Conversion With Discrete Self-Supervised Units , 2022, ArXiv.
[9] Alexander M. Rush,et al. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model , 2022, ArXiv.
[10] Yaniv Taigman,et al. Audio Language Modeling using Perceptually-Guided Discrete Representations , 2022, ArXiv.
[11] Yaniv Taigman,et al. AudioGen: Textually Guided Audio Generation , 2022, ICLR.
[12] Stan Z. Li,et al. A Survey on Generative Diffusion Model , 2022, ArXiv.
[13] Amit H. Bermano,et al. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion , 2022, ICLR.
[14] David J. Fleet,et al. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.
[15] Stella Rose Biderman,et al. VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance , 2022, ECCV.
[16] Prafulla Dhariwal,et al. Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.
[17] Yaniv Taigman,et al. Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors , 2022, ECCV.
[18] B. Ommer,et al. High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[19] Prafulla Dhariwal,et al. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.
[20] Lior Wolf,et al. ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[21] Abdel-rahman Mohamed,et al. Textless Speech Emotion Conversion using Discrete & Decomposed Representations , 2021, EMNLP.
[22] J. Bello,et al. Wav2CLIP: Learning Robust Audio Representations from Clip , 2021, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[23] Jacek Ma'ndziuk,et al. Audio-to-Image Cross-Modal Generation , 2021, 2022 International Joint Conference on Neural Networks (IJCNN).
[24] Esa Rahtu,et al. Taming Visually Guided Sound Generation , 2021, BMVC.
[25] Prafulla Dhariwal,et al. Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.
[26] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[27] Prafulla Dhariwal,et al. Improved Denoising Diffusion Probabilistic Models , 2021, ICML.
[28] Emmanuel Dupoux,et al. On Generative Spoken Language Modeling from Raw Audio , 2021, Transactions of the Association for Computational Linguistics.
[29] Kristen Grauman,et al. VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[30] B. Ommer,et al. Taming Transformers for High-Resolution Image Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[31] Bryan Catanzaro,et al. DiffWave: A Versatile Diffusion Model for Audio Synthesis , 2020, ICLR.
[32] Pieter Abbeel,et al. Denoising Diffusion Probabilistic Models , 2020, NeurIPS.
[33] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[34] Andrew Zisserman,et al. Vggsound: A Large-Scale Audio-Visual Dataset , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[35] Tamir Hazan,et al. Factor Graph Attention , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[36] Shun-Po Chuang,et al. Towards Audio to Scene Image Synthesis Using Generative Adversarial Network , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[37] Sanjeev Khudanpur,et al. X-Vectors: Robust DNN Embeddings for Speaker Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[38] Sepp Hochreiter,et al. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.
[39] Kevin Gimpel,et al. Gaussian Error Linear Units (GELUs) , 2016 .