Progressive Semantic Matching for Video-Text Retrieval

Cross-modal retrieval between texts and videos is important yet challenging. Until recently, previous works in this domain typically rely on learning a common space to match the text and video, but it is difficult to match due to the semantic gap between videos and texts. Although some methods employ coarse-to-fine or multi-expert networks to encode one or more common spaces for easier matching, they almost directly optimize one matching space, which is challenging, because of the huge semantic gap between different modalities. To address this issue, we aim at narrowing semantic gap by a progressive learning process with a coarse-to-fine architecture, and propose a novel Progressive Semantic Matching (PSM) method. We first construct a multilevel encoding network for videos and texts, and design some auxiliary common spaces, which are mapped by the outputs of encoders in different levels. Then all the common spaces are jointly trained end to end. In this way, the model can effectively encode videos and texts into a fusion common space by a progressive paradigm. Experimental results on three video-text datasets (i.e., MSR-VTT, TIGF and MSVD) demonstrate the advantages of our PSM, which achieves significant performance improvement compared with state-of-the-art approaches.

[1]  Chong-Wah Ngo,et al.  Practical elimination of near-duplicates from web video search , 2007, ACM Multimedia.

[2]  Leonid Sigal,et al.  Learning Language-Visual Embedding for Movie Understanding with Natural-Language , 2016, ArXiv.

[3]  Gunhee Kim,et al.  A Joint Sequence Fusion Model for Video Question Answering and Retrieval , 2018, ECCV.

[4]  Ivan Laptev,et al.  HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Xianglong Liu,et al.  Graph Convolutional Network Hashing for Cross-Modal Retrieval , 2019, IJCAI.

[7]  Yang Liu,et al.  Use What You Have: Video retrieval using representations from collaborative experts , 2019, BMVC.

[8]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[9]  Shizhe Chen,et al.  Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[12]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[13]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Changsheng Xu,et al.  Multi-Level Correlation Adversarial Hashing for Cross-Modal Retrieval , 2020, IEEE Transactions on Multimedia.

[15]  Wei Chen,et al.  Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework , 2015, AAAI.

[16]  Tat-Seng Chua,et al.  Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval , 2020, SIGIR.

[17]  Wei Wang,et al.  Learning Coarse-to-Fine Graph Neural Networks for Video-Text Retrieval , 2020, IEEE Transactions on Multimedia.

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  Ivan Laptev,et al.  Learning a Text-Video Embedding from Incomplete and Heterogeneous Data , 2018, ArXiv.

[20]  Yale Song,et al.  TGIF: A New Dataset and Benchmark on Animated GIF Description , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Yale Song,et al.  Polysemous Visual-Semantic Embedding for Cross-Modal Retrieval , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Tao Mei,et al.  Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Cees G. M. Snoek,et al.  Shuffled ImageNet Banks for Video Event Detection and Search , 2020, ACM Trans. Multim. Comput. Commun. Appl..

[25]  Dima Damen,et al.  Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[27]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[28]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[29]  Amit K. Roy-Chowdhury,et al.  Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval , 2018, ICMR.

[30]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[31]  Xirong Li,et al.  W2VV++: Fully Deep Learning for Ad-hoc Video Search , 2019, ACM Multimedia.

[32]  Xirong Li,et al.  Predicting Visual Features From Text for Image and Video Caption Retrieval , 2017, IEEE Transactions on Multimedia.

[33]  Fei Wang,et al.  Million-scale near-duplicate video retrieval system , 2011, ACM Multimedia.

[34]  Xirong Li,et al.  Dual Encoding for Zero-Example Video Retrieval , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).