DL-61-86 at TRECVID 2017: Video-to-Text Description

In this paper, we summarize our work at the video-totext description task (VTT) of TRECVID 2017. This year we participated in the matching and ranking subtask of VTT. Our entry is based on the Word2VisualVec [13] and a newly devised Spatial Enhanced Representation (SER). The Word2VisualVec is a deep neural network architecture that learns to predict a deep visual encoding of textual input. It is the winning entry in the VTT task of TRECVID 2016. We improve the Word2VisualVec by replacing the average pooling on the textual input with the multi-scale sentence vectorization [6] and using an improved triplet ranking loss [7]. The SER consists of two neural network branches which project videos and sentences into a learned latent space, respectively. For the video side branch, the model extracts an enhanced spatio-temporal representation for the input video. We implement this by learning a GRU with skip-connections that allow bypassing of the spatial feature. Our best run is the ensemble of six models which are variants of Word2VisualVec and SER. It leads the evaluation with a great margin in the context of all submissions from ten teams worldwide.

[1]  Jonathan G. Fiscus,et al.  TRECVID 2016: Evaluating Video Search, Video Event Detection, Localization, and Hyperlinking , 2016, TRECVID.

[2]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[3]  Xirong Li,et al.  University of Amsterdam and Renmin University at TRECVID 2016: Searching Video, Detecting Events and Describing Video , 2016, TRECVID.

[4]  David J. Fleet,et al.  VSE++: Improved Visual-Semantic Embeddings , 2017, ArXiv.

[5]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[7]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Xirong Li,et al.  Predicting Visual Features From Text for Image and Video Caption Retrieval , 2017, IEEE Transactions on Multimedia.

[10]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[11]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[12]  Xirong Li,et al.  Early Embedding and Late Reranking for Video Captioning , 2016, ACM Multimedia.

[13]  Yale Song,et al.  TGIF: A New Dataset and Benchmark on Animated GIF Description , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Georges Quénot,et al.  TRECVID 2017: Evaluating Ad-hoc and Instance Video Search, Events Detection, Video Captioning and Hyperlinking , 2017, TRECVID.

[15]  Dennis Koelma,et al.  The ImageNet Shuffle: Reorganized Pre-training for Video Event Detection , 2016, ICMR.