DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

Existing text-video retrieval solutions are, in essence, discriminant models focused on maximizing the conditional likelihood, i.e., p(candidates|query). While straightforward, this de facto paradigm overlooks the underlying data distribution p(query), which makes it challenging to identify out-of-distribution data. To address this limitation, we creatively tackle this task from a generative viewpoint and model the correlation between the text and the video as their joint probability p(candidates,query). This is accomplished through a diffusion-based text-video retrieval framework (DiffusionRet), which models the retrieval task as a process of gradually generating joint distribution from noise. During training, DiffusionRet is optimized from both the generation and discrimination perspectives, with the generator being optimized by generation loss and the feature extractor trained with contrastive loss. In this way, DiffusionRet cleverly leverages the strengths of both generative and discriminative methods. Extensive experiments on five commonly used text-video retrieval benchmarks, including MSRVTT, LSMDC, MSVD, ActivityNet Captions, and DiDeMo, with superior performances, justify the efficacy of our method. More encouragingly, without any modification, DiffusionRet even performs well in out-domain retrieval settings. We believe this work brings fundamental insights into the related fields. Code will be available at https://github.com/jpthu17/DiffusionRet.

[1]  Chang Liu,et al.  Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Chang Liu,et al.  Parallel Vertex Diffusion for Unified Visual Grounding , 2023, AAAI.

[3]  Ying Shan,et al.  Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval , 2023, ArXiv.

[4]  Yinhuai Wang,et al.  Zero-Shot Image Restoration Using Denoising Diffusion Null-Space Model , 2022, ICLR.

[5]  P. Luo,et al.  DiffusionDet: Diffusion Model for Object Detection , 2022, ArXiv.

[6]  Lingpeng Kong,et al.  DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models , 2022, ICLR.

[7]  Wenguan Wang,et al.  GMMSeg: Gaussian Mixture based Generative Semantic Segmentation Models , 2022, NeurIPS.

[8]  Amit H. Bermano,et al.  Human Motion Diffusion Model , 2022, ICLR.

[9]  Hao Li,et al.  Toward 3D Spatial Reasoning for Human-like Text-based Visual Question Answering , 2022, ArXiv.

[10]  Jonathan Ho Classifier-Free Diffusion Guidance , 2022, ArXiv.

[11]  Jie Chen,et al.  Locality Guidance for Improving Vision Transformers on Tiny Datasets , 2022, ECCV.

[12]  Luhui Xu,et al.  TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval , 2022, ECCV.

[13]  Ming Yan,et al.  X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval , 2022, ACM Multimedia.

[14]  Yi Ren,et al.  ProDiff: Progressive Fast Diffusion Model for High-Quality Text-to-Speech , 2022, ACM Multimedia.

[15]  Emmanuel Asiedu Brempong,et al.  Denoising Pretraining for Semantic Segmentation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[16]  Xiang Lisa Li,et al.  Diffusion-LM Improves Controllable Text Generation , 2022, NeurIPS.

[17]  Hao Li,et al.  Joint Learning of Object Graph and Relation Graph for Visual Question Answering , 2022, 2022 IEEE International Conference on Multimedia and Expo (ICME).

[18]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[19]  Animesh Garg,et al.  X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Xiansheng Hua,et al.  Disentangled Representation Learning for Text-Video Retrieval , 2022, ArXiv.

[21]  S. Hoi,et al.  BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[22]  Samuel Albanie,et al.  Cross Modal Retrieval with Querybank Normalisation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  A. Voynov,et al.  Label-Efficient Semantic Segmentation with Diffusion Models , 2021, ICLR.

[24]  Nan Duan,et al.  CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval , 2021, Neurocomputing.

[25]  R. Yu,et al.  Dynamic Clustering Network for Unsupervised Semantic Segmentation , 2022, ArXiv.

[26]  Lior Wolf,et al.  SegDiff: Image Segmentation with Diffusion Probabilistic Models , 2021, ArXiv.

[27]  Rianne van den Berg,et al.  Structured Denoising Diffusion Models in Discrete State-Spaces , 2021, NeurIPS.

[28]  Pengfei Xiong,et al.  CLIP2Video: Mastering Video-Text Retrieval via Image CLIP , 2021, ArXiv.

[29]  Tasnima Sadekova,et al.  Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech , 2021, ICML.

[30]  Prafulla Dhariwal,et al.  Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[31]  Linchao Zhu,et al.  T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Hailin Jin,et al.  TeachText: CrossModal Generalized Distillation for Text-Video Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Andrew Zisserman,et al.  Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[35]  Zhe Gan,et al.  Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[37]  Florian Metze,et al.  Support-set bottlenecks for video-text representation learning , 2020, ICLR.

[38]  Jiaming Song,et al.  Denoising Diffusion Implicit Models , 2020, ICLR.

[39]  Jian Wu,et al.  Dig into Multi-modal Cues for Video Retrieval with Hierarchical Alignment , 2021, IJCAI.

[40]  Chen Sun,et al.  Multi-modal Transformer for Video Retrieval , 2020, ECCV.

[41]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[42]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[43]  Ilya Sutskever,et al.  Jukebox: A Generative Model for Music , 2020, ArXiv.

[44]  Shizhe Chen,et al.  Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[46]  Ross B. Girshick,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Yang Liu,et al.  Use What You Have: Video retrieval using representations from collaborative experts , 2019, BMVC.

[48]  Ivan Laptev,et al.  HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[49]  Ali Razavi,et al.  Generating Diverse High-Fidelity Images with VQ-VAE-2 , 2019, NeurIPS.

[50]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[51]  Gunhee Kim,et al.  A Joint Sequence Fusion Model for Video Question Answering and Retrieval , 2018, ECCV.

[52]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[53]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[54]  Trevor Darrell,et al.  Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[55]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[56]  Juan Carlos Niebles,et al.  Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[57]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[58]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[60]  Bernt Schiele,et al.  A dataset for Movie Description , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[62]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[63]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[64]  Evgueni A. Haroutunian,et al.  Information Theory and Statistics , 2011, International Encyclopedia of Statistical Science.

[65]  A. P. Dawid,et al.  Generative or Discriminative? Getting the Best of Both Worlds , 2007 .

[66]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[67]  Brendan J. Frey,et al.  Does the Wake-sleep Algorithm Produce Good Density Estimators? , 1995, NIPS.