NTU ROSE Lab at TRECVID 2018: Ad-hoc Video Search and Video to Text

This paper describes our participation in the ad-hoc video search and video to text tasks of TRECVID 2018. In ad-hoc video search, we adapted an image-based visual semantic embedding approach and trained our model on combined MS COCO and Flicker30k datasets. We extracted multiple keyframes from each shot and performed similarity search using the computed embeddings. In video to text, description generation task, we trained a video captioning model with multiple features using a reinforcement learning method on the combination of MSR-VTT and MSVD video captioning datasets. For the matching and ranking subtask, we trained two types of image-based ranking models on the MS COCO dataset. 1 Ad-hoc Video Search (AVS) In the ad-hoc video search task, we are given 30 free text queries and required to return the top 1000 shots from the test set videos [1, 2]. The queries are given in Appendix A. The test set contains 4593 Internet Archive videos of 600 hours with 450K shots (publicly available on TRECVID website). Videos have a duration of 6.5 minutes to 9.5 minutes. The reference shot boundaries are publicly available. No annotated training data was provided specifically for the AVS task. We participated with a “fully automatic” (Type F) system trained on already available annotated datasets (Type D), MS COCO and Flicker30k [5, 6]. 1.1 Visual Semantic Embedding We adapted the visual semantic embedding, VSE/VSE++ [3, 4], for cross modal retrieval. Given a set of image-caption pairs, VSE++ learns a joint embedding space, which can be used for cross-modal retrieval, i.e., given text, retrieve images/videos or vice versa.

[1]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[2]  Shin'ichi Satoh,et al.  Consensus-based Sequence Training for Video Captioning , 2017, ArXiv.

[3]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[5]  Gang Wang,et al.  Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Sanja Fidler,et al.  Order-Embeddings of Images and Language , 2015, ICLR.

[7]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[8]  George Awad,et al.  On Influential Trends in Interactive Video Retrieval: Video Browser Showdown 2015–2017 , 2018, IEEE Transactions on Multimedia.

[9]  Jonathan G. Fiscus,et al.  TRECVID 2018: Benchmarking Video Activity Detection, Video Captioning and Matching, Video Storytelling Linking and Video Search , 2018, TRECVID.

[10]  Walter Daelemans,et al.  Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , 2014, EMNLP 2014.

[11]  Aren Jansen,et al.  CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Xi Chen,et al.  Stacked Cross Attention for Image-Text Matching , 2018, ECCV.

[13]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[14]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[15]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[16]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[17]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).