BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions

Pre-training a model to learn transferable video-text representation for retrieval has attracted a lot of attention in recent years. Previous dominant works mainly adopt two separate encoders for efficient retrieval, but ignore local associations between videos and texts. Another line of research uses a joint encoder to interact video with texts, but results in low efficiency since each text-video pair needs to be fed into the model. In this work, we enable finegrained video-text interactions while maintaining high efficiency for retrieval via a novel pretext task, dubbed as Multiple Choice Questions (MCQ), where a parametric module BridgeFormer is trained to answer the “questions” constructed by the text features via resorting to the video features. Specifically, we exploit the rich semantics of text (i.e., nouns and verbs) to build questions, with which the video encoder can be trained to capture more regional content and temporal dynamics. In the form of questions and answers, the semantic associations between local video-text features can be properly established. BridgeFormer is able to be removed for downstream retrieval, rendering an efficient and flexible model with only two encoders. Our method outperforms state-of-the-art methods on the popular text-to-video retrieval task in five datasets with different experimental setups (i.e., zero-shot and fine-tune), including HowTo100M (one million videos). We further conduct zeroshot action recognition, which can be cast as video-to-text retrieval, and our approach also significantly surpasses its counterparts. As an additional benefit, our method achieves competitive results with much shorter pre-training videos on single-modality downstream tasks, e.g., action recognition with linear evaluation.

[1]  Cordelia Schmid,et al.  Learning Video Representations using Contrastive Bidirectional Transformer , 2019 .

[2]  Yonatan Bisk,et al.  TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Shih-Fu Chang,et al.  VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text , 2021, NeurIPS.

[4]  Nan Duan,et al.  UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation , 2020, ArXiv.

[5]  Andrew Zisserman,et al.  End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Florian Metze,et al.  VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding , 2021, FINDINGS.

[7]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[9]  Rami Ben-Ari,et al.  Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning , 2020, AAAI.

[10]  Andrew Zisserman,et al.  Self-Supervised MultiModal Versatile Networks , 2020, NeurIPS.

[11]  James Glass,et al.  AVLnet: Learning Audio-Visual Language Representations from Instructional Videos , 2021, Interspeech 2021.

[12]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[13]  Andrew Zisserman,et al.  Memory-augmented Dense Predictive Coding for Video Representation Learning , 2020, ECCV.

[14]  Heng Wang,et al.  Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[16]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[17]  Michael S. Ryoo,et al.  Evolving Losses for Unsupervised Video Representation Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Zhe Gan,et al.  Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Bernard Ghanem,et al.  Self-Supervised Learning by Cross-Modal Audio-Video Clustering , 2019, NeurIPS.

[20]  Nan Duan,et al.  CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval , 2021, Neurocomputing.

[21]  Yang Liu,et al.  Use What You Have: Video retrieval using representations from collaborative experts , 2019, BMVC.

[22]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[23]  Hugo Terashima-Mar'in,et al.  A Straightforward Framework For Video Retrieval Using CLIP , 2021, MCPR.

[24]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Florian Metze,et al.  Support-set bottlenecks for video-text representation learning , 2020, ICLR.

[26]  Thomas Brox,et al.  COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning , 2020, NeurIPS.

[27]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[28]  Bernt Schiele,et al.  A dataset for Movie Description , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Chen Sun,et al.  Multi-modal Transformer for Video Retrieval , 2020, ECCV.

[30]  Zhe Gan,et al.  HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training , 2020, EMNLP.

[31]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[32]  James Glass,et al.  Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Ivan Laptev,et al.  HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Andrew Zisserman,et al.  Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval , 2021, ArXiv.

[36]  Yi Yang,et al.  ActBERT: Learning Global-Local Video-Text Representations , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.

[38]  Andrew Zisserman,et al.  Self-supervised Co-training for Video Representation Learning , 2020, NeurIPS.

[39]  Tomokazu Murakami,et al.  Cycle-Contrast for Self-Supervised Video Representation Learning , 2020, NeurIPS.

[40]  Trevor Darrell,et al.  Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[41]  Ildoo Kim,et al.  ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision , 2021, ICML.

[42]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[43]  Dmytro Okhonko,et al.  VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding , 2021, EMNLP.

[44]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.