Any-to-Any Generation via Composable Diffusion
暂无分享,去创建一个
[1] Jerry Li,et al. Automatic Prompt Optimization with "Gradient Descent" and Beam Search , 2023, ArXiv.
[2] Seung Wook Kim,et al. Align Your Latents: High-Resolution Video Synthesis with Latent Diffusion Models , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[3] Sonal Gupta,et al. Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-Video Generation , 2023, ArXiv.
[4] Jing Liu,et al. VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset , 2023, ArXiv.
[5] Marco Tulio Ribeiro,et al. Sparks of Artificial General Intelligence: Early experiments with GPT-4 , 2023, ArXiv.
[6] Patrick Esser,et al. Structure and Content-Guided Video Synthesis with Diffusion Models , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).
[7] Jingren Zhou,et al. mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video , 2023, ICML.
[8] Jia-Bin Huang,et al. Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models , 2023, ICML.
[9] S. Savarese,et al. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ArXiv.
[10] Wenwu Wang,et al. AudioLDM: Text-to-Audio Generation with Latent Diffusion Models , 2023, ICML.
[11] Benjamin Elizalde,et al. CLAP: Learning Audio Concepts From Natural Language Supervision , 2022, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[12] Ting Yao,et al. Semantic-Conditional Diffusion Networks for Image Captioning* , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[13] G. Hua,et al. Exploring Discrete Diffusion Models for Image Captioning , 2022, ArXiv.
[14] Humphrey Shi,et al. Versatile Diffusion: Text, Images and Variations All in One Diffusion Model , 2022, ArXiv.
[15] Eungbeom Kim,et al. Improving Audio-Language Learning with MixGen and Multi-Level Test-Time Augmentation , 2022, ArXiv.
[16] Ludwig Schmidt,et al. LAION-5B: An open large-scale dataset for training next generation image-text models , 2022, NeurIPS.
[17] David J. Fleet,et al. Imagen Video: High Definition Video Generation with Diffusion Models , 2022, ArXiv.
[18] Yaniv Taigman,et al. Make-A-Video: Text-to-Video Generation without Text-Video Data , 2022, ICLR.
[19] Mohit Bansal,et al. TVLT: Textless Vision-Language Transformer , 2022, NeurIPS.
[20] Wendi Zheng,et al. CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers , 2022, ICLR.
[21] Zhe Gan,et al. GIT: A Generative Image-to-text Transformer for Vision and Language , 2022, Trans. Mach. Learn. Res..
[22] N. Codella,et al. i-Code: An Integrative and Composable Multimodal Learning Framework , 2022, AAAI.
[23] Oriol Vinyals,et al. Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.
[24] Prafulla Dhariwal,et al. Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.
[25] David J. Fleet,et al. Video Diffusion Models , 2022, NeurIPS.
[26] Yaniv Taigman,et al. Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors , 2022, ECCV.
[27] Jingren Zhou,et al. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework , 2022, ICML.
[28] C. Schmid,et al. End-to-end Generative Pretraining for Multimodal Video Captioning , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[29] Yejin Choi,et al. MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[30] B. Ommer,et al. High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[31] Prafulla Dhariwal,et al. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.
[32] Jian Liang,et al. NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion , 2021, ECCV.
[33] B. Guo,et al. Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[34] Ron Mokady,et al. ClipCap: CLIP Prefix for Image Captioning , 2021, ArXiv.
[35] Ali Farhadi,et al. MERLOT: Multimodal Neural Script Knowledge Models , 2021, NeurIPS.
[36] Chang Zhou,et al. CogView: Mastering Text-to-Image Generation via Transformers , 2021, NeurIPS.
[37] Guillermo Sapiro,et al. GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions , 2021, ArXiv.
[38] Andrew Zisserman,et al. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[39] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[40] Jaemin Cho,et al. Unifying Vision-and-Language Tasks via Text Generation , 2021, ICML.
[41] Thomas Breuel,et al. ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[42] Abhishek Kumar,et al. Score-Based Generative Modeling through Stochastic Differential Equations , 2020, ICLR.
[43] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[44] Félix Gontier,et al. Automated Audio Captioning by Fine-Tuning BART with AudioSet Tags , 2021, DCASE.
[45] Jaehyeon Kim,et al. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis , 2020, NeurIPS.
[46] Pieter Abbeel,et al. Denoising Diffusion Probabilistic Models , 2020, NeurIPS.
[47] Jianfeng Gao,et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.
[48] Xiujun Li,et al. Optimus: Organizing Sentences via Pre-trained Modeling of a Latent Space , 2020, EMNLP.
[49] Bing Li,et al. Object Relational Graph With Teacher-Recommended Learning for Video Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[50] Gunhee Kim,et al. AudioCaps: Generating Captions for Audios in The Wild , 2019, NAACL.
[51] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[52] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[53] Aren Jansen,et al. Audio Set: An ontology and human-labeled dataset for audio events , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[54] Antonio Torralba,et al. SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.
[55] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[56] Surya Ganguli,et al. Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.
[57] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[58] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.