Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss

Employing large-scale pre-trained model CLIP to conduct video-text retrieval task (VTR) has become a new trend, which exceeds previous VTR methods. Though, due to the heterogeneity of structures and contents between video and text, previous CLIP-based models are prone to overfitting in the training phase, resulting in relatively poor retrieval performance. In this paper, we propose a multi-stream Corpus Alignment network with single gate Mixture-of-Experts (CAMoE) and a novel Dual Softmax Loss (DSL) to solve the two heterogeneity. The CAMoE employs Mixture-ofExperts (MoE) to extract multi-perspective video representations, including action, entity, scene, etc., then align them with the corresponding part of the text. In this stage, we conduct massive explorations towards the feature extraction module and feature alignment module, and conclude an efficient VTR framework. DSL is proposed to avoid the oneway optimum-match which occurs in previous contrastive methods. Introducing the intrinsic prior of each pair in a batch, DSL serves as a reviser to correct the similarity matrix and achieves the dual optimal match. DSL is easy to implement with only one-line code but improves significantly. The results show that the proposed CAMoE and DSL are of strong efficiency, and each of them is capable of achieving State-of-The-Art (SOTA) individually on various benchmarks such as MSR-VTT, MSVD, and LSMDC. Further, with both of them, the performance is advanced to a great extent, surpassing the previous SOTA methods for around 4.6% R@1 in MSR-VTT. The code will be available soon at https://github.com/starmemda/CAMoE/ Introduction Motivation The primary issue limiting VTR task presently is the heterogeneity between different modals, reflected in both structures and contents. The heterogeneity of structures. This mainly lies in the impossibility of directly aligning the words in sentences with corresponding video frames (Jin et al. 2021). Singlestream or two-stream structures are applied to treat text and video as two independent parts for early or late fusion, which *Interns at MMU, KuaiShou Inc. Corresponding author Copyright © 2022, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. ignore the internal relevancy between frames and words, resulting in that the models require massive data to reach decent performance. In this paper, we assume that texts can be parsed into separate sentences with distinct aspects of information. Though directly aligning a word with a frame is unachievable, guiding the model to learn how to align cross modal information is possible. Referring to the example in Fig.2, the video is paired with the sentence ”a boy is performing for an audience.”, where ”boy”, ”performing”, ”audience” are the keywords and can be categorized as ”entity”, ”action”, ”entity” accordingly. We design several experts to learn corresponding representations independently. In addition, a gating module is employed to measure their importance score and then strengthen the representation of the fusion expert. Such innovation brings little parameters and computations increment and surpassing the previous Stateof-The-Art (SOTA) method on various benchmarks. Previous work has taken a similar approach, either by simple part-of-speech tagging or by wielding multi-dimensional features on the video. HGR (Chen et al. 2020) and HCGC (Jin et al. 2021) hypothesize that a text can be constructed into a hierarchical semantic graph structure, where lie sentence, action, entity embedding in the top, second, third level node, respectively. T2VLAD (Wang, Zhu, and Yang 2021) extracts features from the aspects of scene and action, and performs similarity matching with the representations of each local token and the global sentence, while HiT (Liu et al. 2021) conducts cross-matching between featurelevel and semantic-level embedding. But they don’t simultaneously decompose the video and text to conduct deep alignment, from where we proposed the multi-stream multi-task (Ruder 2017) architecture, as shown in Fig.2 The heterogeneity of contents and dual optimal-match hypothesis. Another important contribution of this paper is the proposed problem that semantic and visual modals usually express in a different range of content. The example shown in Fig.1, denotes the comparisons of the process calculating the final probability matrix for Video-to-Text retrieval. Although each video describes specific and explicit content, the corresponding text can be unspecific and fuzzy, which harms model training. The original method conducts the softmax for every single retrieval, ignoring the potential cross-retrieval information and leading to a confusing result. To solve this, we propose the dual optimal-match hypothear X iv :2 10 9. 04 29 0v 2 [ cs .C V ] 1 3 Se p 20 21 A woman is decorating her finger nail A woman is mixing nail polish and putting an egg into it A girl is painting easter designs on nails

[1]  Chen Sun,et al.  Multi-modal Transformer for Video Retrieval , 2020, ECCV.

[2]  LazebnikSvetlana,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2014 .

[3]  Ioannis Patras,et al.  Query and Keyframe Representations for Ad-hoc Video Search , 2017, ICMR.

[4]  Gang Wang,et al.  Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[6]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[7]  Linchao Zhu,et al.  T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[9]  Trevor Darrell,et al.  Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Lei Zhang,et al.  VinVL: Making Visual Representations Matter in Vision-Language Models , 2021, ArXiv.

[11]  Chang Zhou,et al.  CogView: Mastering Text-to-Image Generation via Transformers , 2021, NeurIPS.

[12]  Gunhee Kim,et al.  A Joint Sequence Fusion Model for Video Question Answering and Retrieval , 2018, ECCV.

[13]  Nan Duan,et al.  CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval , 2021, Neurocomputing.

[14]  Duy-Dinh Le,et al.  NII-HITACHI-UIT at TRECVID 2017 , 2016, TRECVID.

[15]  Yang Liu,et al.  Use What You Have: Video retrieval using representations from collaborative experts , 2019, BMVC.

[16]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Andrew Zisserman,et al.  Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval , 2021, ArXiv.

[18]  Qiang Yang,et al.  An Overview of Multi-task Learning , 2018 .

[19]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[20]  Yi Yang,et al.  ActBERT: Learning Global-Local Video-Text Representations , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Sebastian Ruder,et al.  An Overview of Multi-Task Learning in Deep Neural Networks , 2017, ArXiv.

[22]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[24]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[25]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[26]  Christopher Joseph Pal,et al.  Movie Description , 2016, International Journal of Computer Vision.

[27]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[28]  Yueting Zhuang,et al.  Hierarchical Cross-Modal Graph Consistency Learning for Video-Text Retrieval , 2021, SIGIR.

[29]  Bernt Schiele,et al.  The Long-Short Story of Movie Description , 2015, GCPR.

[30]  Amit K. Roy-Chowdhury,et al.  Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval , 2018, ICMR.

[31]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[32]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Pengfei Xiong,et al.  CLIP2Video: Mastering Video-Text Retrieval via Image CLIP , 2021, ArXiv.

[34]  Zhe Gan,et al.  Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Aleksandr Petiushko,et al.  MDMMT: Multidomain Multimodal Transformer for Video Retrieval , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[36]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[37]  Shengsheng Qian,et al.  HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Shizhe Chen,et al.  Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[40]  Cordelia Schmid,et al.  ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Zhe Zhao,et al.  Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts , 2018, KDD.

[42]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.