Renmin University of China and Zhejiang Gongshang University at TRECVID 2019: Learn to Search and Describe Videos

In this paper we summarize our TRECVID 2019 [1] video retrieval experiments. We participated in two tasks: Adhoc Video Search (AVS) and Video-to-Text (VTT). For the AVS task, we develop our solutions based on two deep learning models, i.e. the W2VV++ network [11] and the Dual Encoding Network [7]. For the VTT Matching and Ranking subtask, our entry is also based on the W2VV++ and Dual Encoding Networks. For the VTT Description Generation subtask, we enhance the classical encoder-decoder model with multi-level video encoding and attribute prediction. The 2019 edition of the TRECVID benchmark has been a fruitful participation for our joint-team. Our runs are ranked at the second place for AVS and VTT Matching and Ranking tasks and the third place for the VTT Description Generation subtask in terms of the ciderD criterion. 1 Ad-hoc Video Search

[1]  Dumitru Erhan,et al.  Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Xirong Li,et al.  Predicting Visual Features From Text for Image and Video Caption Retrieval , 2017, IEEE Transactions on Multimedia.

[3]  Xirong Li,et al.  University of Amsterdam and Renmin University at TRECVID 2016: Searching Video, Detecting Events and Describing Video , 2016, TRECVID.

[4]  Duy-Dinh Le,et al.  NII-HITACHI-UIT at TRECVID 2017 , 2016, TRECVID.

[5]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[6]  Xirong Li,et al.  Renmin University of China and Zhejiang Gongshang University at TRECVID 2018: Deep Cross-Modal Embeddings for Video-Text Retrieval , 2018, TRECVID.

[7]  Jonathan G. Fiscus,et al.  TRECVID 2019: An evaluation campaign to benchmark Video Activity Detection, Video Captioning and Matching, and Video Search & retrieval , 2019, TRECVID.

[8]  Xirong Li,et al.  Early Embedding and Late Reranking for Video Captioning , 2016, ACM Multimedia.

[9]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[11]  Jianfeng Dong,et al.  DL-61-86 at TRECVID 2017: Video-to-Text Description , 2017, TRECVID.

[12]  Xirong Li,et al.  W2VV++: Fully Deep Learning for Ad-hoc Video Search , 2019, ACM Multimedia.

[13]  Yale Song,et al.  TGIF: A New Dataset and Benchmark on Animated GIF Description , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[15]  Xirong Li,et al.  Dual Encoding for Zero-Example Video Retrieval , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).