DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
暂无分享,去创建一个
Hao Li | Kehan Li | Li-ming Yuan | Ze-Long Cheng | Peng Jin | Xiang Ji | Chang Liu | Jie Chen
[1] Chang Liu,et al. Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[2] Chang Liu,et al. Parallel Vertex Diffusion for Unified Visual Grounding , 2023, AAAI.
[3] Ying Shan,et al. Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval , 2023, ArXiv.
[4] Yinhuai Wang,et al. Zero-Shot Image Restoration Using Denoising Diffusion Null-Space Model , 2022, ICLR.
[5] P. Luo,et al. DiffusionDet: Diffusion Model for Object Detection , 2022, ArXiv.
[6] Lingpeng Kong,et al. DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models , 2022, ICLR.
[7] Wenguan Wang,et al. GMMSeg: Gaussian Mixture based Generative Semantic Segmentation Models , 2022, NeurIPS.
[8] Amit H. Bermano,et al. Human Motion Diffusion Model , 2022, ICLR.
[9] Hao Li,et al. Toward 3D Spatial Reasoning for Human-like Text-based Visual Question Answering , 2022, ArXiv.
[10] Jonathan Ho. Classifier-Free Diffusion Guidance , 2022, ArXiv.
[11] Jie Chen,et al. Locality Guidance for Improving Vision Transformers on Tiny Datasets , 2022, ECCV.
[12] Luhui Xu,et al. TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval , 2022, ECCV.
[13] Ming Yan,et al. X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval , 2022, ACM Multimedia.
[14] Yi Ren,et al. ProDiff: Progressive Fast Diffusion Model for High-Quality Text-to-Speech , 2022, ACM Multimedia.
[15] Emmanuel Asiedu Brempong,et al. Denoising Pretraining for Semantic Segmentation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
[16] Xiang Lisa Li,et al. Diffusion-LM Improves Controllable Text Generation , 2022, NeurIPS.
[17] Hao Li,et al. Joint Learning of Object Graph and Relation Graph for Visual Question Answering , 2022, 2022 IEEE International Conference on Multimedia and Expo (ICME).
[18] Prafulla Dhariwal,et al. Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.
[19] Animesh Garg,et al. X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[20] Xiansheng Hua,et al. Disentangled Representation Learning for Text-Video Retrieval , 2022, ArXiv.
[21] S. Hoi,et al. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.
[22] Samuel Albanie,et al. Cross Modal Retrieval with Querybank Normalisation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[23] A. Voynov,et al. Label-Efficient Semantic Segmentation with Diffusion Models , 2021, ICLR.
[24] Nan Duan,et al. CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval , 2021, Neurocomputing.
[25] R. Yu,et al. Dynamic Clustering Network for Unsupervised Semantic Segmentation , 2022, ArXiv.
[26] Lior Wolf,et al. SegDiff: Image Segmentation with Diffusion Probabilistic Models , 2021, ArXiv.
[27] Rianne van den Berg,et al. Structured Denoising Diffusion Models in Discrete State-Spaces , 2021, NeurIPS.
[28] Pengfei Xiong,et al. CLIP2Video: Mastering Video-Text Retrieval via Image CLIP , 2021, ArXiv.
[29] Tasnima Sadekova,et al. Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech , 2021, ICML.
[30] Prafulla Dhariwal,et al. Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.
[31] Linchao Zhu,et al. T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[32] Hailin Jin,et al. TeachText: CrossModal Generalized Distillation for Text-Video Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[33] Andrew Zisserman,et al. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[34] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[35] Zhe Gan,et al. Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[36] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[37] Florian Metze,et al. Support-set bottlenecks for video-text representation learning , 2020, ICLR.
[38] Jiaming Song,et al. Denoising Diffusion Implicit Models , 2020, ICLR.
[39] Jian Wu,et al. Dig into Multi-modal Cues for Video Retrieval with Hierarchical Alignment , 2021, IJCAI.
[40] Chen Sun,et al. Multi-modal Transformer for Video Retrieval , 2020, ECCV.
[41] Pieter Abbeel,et al. Denoising Diffusion Probabilistic Models , 2020, NeurIPS.
[42] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[43] Ilya Sutskever,et al. Jukebox: A Generative Model for Music , 2020, ArXiv.
[44] Shizhe Chen,et al. Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[45] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.
[46] Ross B. Girshick,et al. Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[47] Yang Liu,et al. Use What You Have: Video retrieval using representations from collaborative experts , 2019, BMVC.
[48] Ivan Laptev,et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[49] Ali Razavi,et al. Generating Diverse High-Fidelity Images with VQ-VAE-2 , 2019, NeurIPS.
[50] Jeff Donahue,et al. Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.
[51] Gunhee Kim,et al. A Joint Sequence Fusion Model for Video Question Answering and Retrieval , 2018, ECCV.
[52] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.
[53] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .
[54] Trevor Darrell,et al. Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[55] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[56] Juan Carlos Niebles,et al. Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[57] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.
[58] Tao Mei,et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[59] Surya Ganguli,et al. Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.
[60] Bernt Schiele,et al. A dataset for Movie Description , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[61] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[62] Max Welling,et al. Auto-Encoding Variational Bayes , 2013, ICLR.
[63] Yoshua Bengio,et al. Deep Sparse Rectifier Neural Networks , 2011, AISTATS.
[64] Evgueni A. Haroutunian,et al. Information Theory and Statistics , 2011, International Encyclopedia of Statistical Science.
[65] A. P. Dawid,et al. Generative or Discriminative? Getting the Best of Both Worlds , 2007 .
[66] Michael I. Jordan,et al. Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.
[67] Brendan J. Frey,et al. Does the Wake-sleep Algorithm Produce Good Density Estimators? , 1995, NIPS.