Tree-Augmented Cross-Modal Encoding for Complex-Query Video Retrieval

The rapid growth of user-generated videos on the Internet has intensified the need for text-based video retrieval systems. Traditional methods mainly favor the concept-based paradigm on retrieval with simple queries, which are usually ineffective for complex queries that carry far more complex semantics. Recently, embedding-based paradigm has emerged as a popular approach. It aims to map the queries and videos into a shared embedding space where semantically-similar texts and videos are much closer to each other. Despite its simplicity, it forgoes the exploitation of the syntactic structure of text queries, making it suboptimal to model the complex queries. To facilitate video retrieval with complex queries, we propose a Tree-augmented Cross-modal Encoding method by jointly learning the linguistic structure of queries and the temporal representation of videos. Specifically, given a complex user query, we first recursively compose a latent semantic tree to structurally describe the text query. We then design a tree-augmented query encoder to derive structure-aware query representation and a temporal attentive video encoder to model the temporal characteristics of videos. Finally, both the query and videos are mapped into a joint embedding space for matching and ranking. In this approach, we have a better understanding and modeling of the complex queries, thereby achieving a better video retrieval performance. Extensive experiments on large scale video retrieval benchmark datasets demonstrate the effectiveness of our approach.

[1]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[2]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[3]  Victor O. K. Li,et al.  Neural Machine Translation with Gumbel-Greedy Decoding , 2017, AAAI.

[4]  Marcel Worring,et al.  Concept-Based Video Retrieval , 2009, Found. Trends Inf. Retr..

[5]  Jihun Choi,et al.  Learning to Compose Task-Specific Tree Structures , 2017, AAAI.

[6]  Xirong Li,et al.  Dual Encoding for Zero-Example Video Retrieval , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[8]  Meng Wang,et al.  Learning Visual Semantic Relationships for Efficient Visual Retrieval , 2015, IEEE Transactions on Big Data.

[9]  Xu Chen,et al.  Joint Representation Learning of Cross-lingual Words and Entities via Attentive Distant Supervision , 2018, EMNLP.

[10]  Xirong Li,et al.  W2VV++: Fully Deep Learning for Ad-hoc Video Search , 2019, ACM Multimedia.

[11]  Wang Ling,et al.  Memory Architectures in Recurrent Neural Network Language Models , 2018, ICLR.

[12]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[13]  Lior Wolf,et al.  Associating neural word embeddings with deep image representations using Fisher Vectors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Leonid Sigal,et al.  Learning Language-Visual Embedding for Movie Understanding with Natural-Language , 2016, ArXiv.

[15]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Meng Wang,et al.  Coherent Semantic-Visual Indexing for Large-Scale Image Retrieval in the Cloud , 2017, IEEE Transactions on Image Processing.

[17]  Jongwook Choi,et al.  End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Ioannis Patras,et al.  Query and Keyframe Representations for Ad-hoc Video Search , 2017, ICMR.

[19]  Gunhee Kim,et al.  A Joint Sequence Fusion Model for Video Question Answering and Retrieval , 2018, ECCV.

[20]  Amit K. Roy-Chowdhury,et al.  Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval , 2018, ICMR.

[21]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[22]  Meng Wang,et al.  Learning concept bundles for video search with complex queries , 2011, MM '11.

[23]  Yu Qiao,et al.  Find and Focus: Retrieve and Localize Video Events with Natural Language Queries , 2018, ECCV.

[24]  Ivan Laptev,et al.  Learning a Text-Video Embedding from Incomplete and Heterogeneous Data , 2018, ArXiv.

[25]  Ivan Laptev,et al.  Learning from Video and Text via Large-Scale Discriminative Clustering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Xu Chen,et al.  Bridge Text and Knowledge by Learning Multi-Prototype Entity Mention Embedding , 2017, ACL.

[27]  Liang Wang,et al.  Improving Description-Based Person Re-Identification by Multi-Granularity Image-Text Alignments , 2019, IEEE Transactions on Image Processing.

[28]  Yiannis Kompatsiaris,et al.  ITI-CERTH participation in TRECVID 2018 , 2017, TRECVID.

[29]  Kevin Gimpel,et al.  Visually Grounded Neural Syntax Acquisition , 2019, ACL.

[30]  Dima Damen,et al.  Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Xirong Li,et al.  Predicting Visual Features From Text for Image and Video Caption Retrieval , 2017, IEEE Transactions on Multimedia.

[32]  Wei Chen,et al.  Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework , 2015, AAAI.

[33]  Xiaogang Wang,et al.  Person Search with Natural Language Description , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[35]  Tal Hassner,et al.  Temporal Tessellation: A Unified Approach for Video Analysis , 2016, ICCV.

[36]  Qing Li,et al.  VIREO @ TRECVID 2017: Video-to-Text, Ad-hoc Video Search, and Video hyperlinking , 2017, TRECVID.

[37]  Ivan Laptev,et al.  HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Tetsuji Ogawa,et al.  Waseda_Meisei at TRECVID 2018: Ad-hoc Video Search , 2018, TRECVID.

[39]  Yiannis Kompatsiaris,et al.  ITI-CERTH participation to TRECVID 2015 , 2015, TRECVID.

[40]  Jascha Sohl-Dickstein,et al.  Capacity and Trainability in Recurrent Neural Networks , 2016, ICLR.

[41]  Meng Wang,et al.  Person Re-Identification With Metric Learning Using Privileged Information , 2018, IEEE Transactions on Image Processing.

[42]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Bernt Schiele,et al.  A dataset for Movie Description , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Sanja Fidler,et al.  Visual Semantic Search: Retrieving Videos via Complex Textual Queries , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Meng Wang,et al.  Utilizing Related Samples to Enhance Interactive Concept-Based Video Search , 2011, IEEE Transactions on Multimedia.

[46]  Qi Tian,et al.  Enhancing Person Re-identification in a Self-Trained Subspace , 2017, ACM Trans. Multim. Comput. Commun. Appl..

[47]  Xirong Li,et al.  University of Amsterdam and Renmin University at TRECVID 2016: Searching Video, Detecting Events and Describing Video , 2016, TRECVID.

[48]  Duy-Dinh Le,et al.  NII-HITACHI-UIT at TRECVID 2017 , 2016, TRECVID.

[49]  Rogério Schmidt Feris,et al.  Dialog-based Interactive Image Retrieval , 2018, NeurIPS.

[50]  Jongwook Choi,et al.  Video Captioning and Retrieval Models with Semantic Attention , 2016, ArXiv.

[51]  Qi Tian,et al.  Video-Based Cross-Modal Recipe Retrieval , 2019, ACM Multimedia.

[52]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[53]  Qi Tian,et al.  Cross-modal Moment Localization in Videos , 2018, ACM Multimedia.

[54]  Xirong Li Deep Learning for Video Retrieval by Natural Language , 2019 .