VIREO @ TRECVID 2017: Video-to-Text, Ad-hoc Video Search, and Video hyperlinking

In this paper, we describe the systems developed for Video-to-Text (VTT), Ad-hoc Video Search (AVS) and Video Hyper-linking (LNK) tasks at TRECVID 2017 [1] and the achieved results. Video-to-Text Description (VTT): We participate in the TRECVID 2017 pilot Task of Video-to-Text Description, which consist of two subtasks, .i.e, Matching and Ranking, and Description Generation. Matching and Ranking task : To compare the effectiveness of spatial and temporal attentions, we experiment: No attention model: Each video is represented by average pooling over both the spatial and temporal dimensions of the ResNet-152 features extracted from frames and text description is encoded by LSTM. Then we learn an embedding space to minimize the distance between the corresponding video and text description in the format of triple loss. Furthermore, C3D is utilized to extract the motion features of videos. Similarity scores from two kinds of features are averagely fused for the final ranking. Spatial attention model: Average pooling is only used in the temporal dimension and the spatial dimension is kept. Then we train attention model on the spatial feature map of video to compute the similarity score, which is used for final ranking. Temporal attention model: Different from the spatial attention model, we train attention model on frame-level, and perform average pooling over the spatial dimension. No-spatial-temporal attention model: similarity scores from the above three models are averagely fused for the final ranking. Description Generation task : We adopt the similar approach as the matching and ranking task and the difference is that LSTM is used to generate the sentence word by word. More details about the model can be seen in [2, 3]. Our submission can be summarized as: No attention model: Each video is represented by average pooling over both the spatial and temporal dimension of the ResNet-152 features extracted from frames and LSTM is used to generate the sentence word-by-word. Furthermore, we concatenate the features of ResNet-152 and C3D to feed into the LSTM to generate descriptions for videos. Spatial attention model: For video features, average pooling is only used in temporal dimension and the spatial dimension is kept. Then we train attention model to access different feature when generating different words in the sentence. Temporal attention model: For video features, this model does the average pooling over the spatial dimension, and learn attention model over the temporal dimension. Ad-Hoc Video Search (AVS) We merged three search systems for AVS. One is our concept-based, zero-example video search system which has been proved useful in previous years [4], one is a video captioning system which is individually trained in VTT task, the other is a text-based search system which computes similarities between query and videos using the metadata extracted from the videos. In this study, we intend to find whether the combination of the concept-based system, captioning system and text-based search system would do any help to improve search performance. We submit 5 fully automatic runs and 3 manually-assisted runs. Our runs are listed as follows: F_D_VIREO.17_1 : An automatic run with infAP=0.093 uses concept-based video search system only. The number of concepts in the concept bank is about 15K, which includes a collection of concepts from ImageNet Shuffle [5], FCVID [6], Sports-1M [7], SIN [8], Places and Research Set [9, 10]. Most of the concept detectors are trained or fine-tuned with ResNet-50 [11]. F_D_VIREO.17_2 : An automatic run combines the results of F_D_VIREO.17_1 and the video captioning system. In the captioning system, both ResNet and C3D features are used. The ratio between ResNet, C3D and concept results in the weight fusion is 2:1:3. The performance stays at infAP=0.120. F_D_VIREO.17_3 : The results of concept-based system and the video captioning system are combined with the same approach in F_D_VIREO.17_2 but the weight fusion is the average of ResNet, C3D and concept results. This run gets infAP=0.116. F_D_VIREO.17_4 : The results of F_D_VIREO.17_3 and the text-based search system are combined in this run. The meta-data, the on-screen text and the speech are extracted from the videos and fed into Lucene to build the text-based search system. The ratio between F_D_VIREO.17_3 and text-based search results in the weight fusion is 10:1. The performance stays at infAP=0.116. M_D_VIREO.17_1 : This manual run is based on F_D_VIREO.17_1 using concept-based video search system. Human efforts are involved in two steps. (1) A user corrects the mistakes in queries after automatic NLP parsing. (2) Automatically proposed semantic concepts are screened by the user by deleting the unrelated, non-discriminative concepts. The run gets infAP=0.124. M_D_VIREO.17_2 : In this run, the result of M_D_VIREO.17_1 has been fused with the result of the captioning system used in F_D_VIREO.17_2. The ratio between ResNet, C3D and concept results in the weight fusion is manually selected based on user’s experiences. The run ends up with infAP=0.164. M_D_VIREO.17_3 : This run is an auto run which combines of the result from F_D_VIREO.17_2 with the result of the text-based search system in F_D_VIREO.17_4. The ratio of the weight fusion between F_D_VIREO.17_2 and text-based search results is 10:1. This run achieves the best performance among our automatic runs with infAP=0.120. M_D_VIREO.17_4 : This run gets the best performance among manual runs submitted with infAP = 0.164. In the run, the results of M_D_VIREO.17_2 and the text-based search system are combined with the ratio of 10:x in the weight fusion, where x is defined by analyzing the popularity of query terms using Google Books Ngram. Video Hyperlinking (LNK): We introduce two novelties: (a) development of new semantic representation network (SRN) for evaluation of cross-modal similarities; (b) re-ranking of search result by considering data risk based on the statistical properties of hubness, local intrinsic dimension (LID) and diversity [12]. Run-1 : Visual baseline. The visual run relies on large concept banks including more than 14K of concept classifiers [13]. The relatedness between anchors and targets is evaluated based on the average fusion of SRN and cosine similarity. Run-2 : Rerun of Run-1 using the LID-first algorithm proposed in [12]. The goal is to promote the ranks of targets with “lower data risk”, specifically, in lower local dimension, being hubs of data, and sufficiently diverse from neighboring region. Run-3 : Multimodal baseline. This run combines visual Run-1 and the text features extracted from ASR. Using SRN, we evaluate four different kinds of similarities between anchors and targets: visualvisual, visual-text, text-visual and text-text. These similarities are averagely fused to quantify the relatedness between anchor and targets. Finally, we further fuse the three kinds of relatedness: SRN, visual cosine similarity, textual cosine similarity with the weights of 0.5, 0.3 and 0.2 respectively. Run-4 : Rerun of Run-3 using the LID-first algorithm proposed in [12]. 1 Video-to-Text Description (VTT) The TRECVID 2017 pilot Task of Video-to-Text Description is challenging since it involves detailed understanding of the video content including many concepts, such as objects, actions, scenes, personobject relations, temporal order of events and so on. Moreover, the inter-modality correspondence between video content and natural language sentences is also nontrivial for this task. In this task, a set of 1,880 Vine videos are randomly selected from more than 50,000 Twitter Vine videos. Each video has a duration of around 6 seconds and is annotated multiple times by different annotators. When describing each video, annotators are asked to write a sentence include, if appropriate and applicable, four facets of the video, .i.e, who, what, where and when. 1.1 Matching and Ranking 1.1.1 Task Description In this subtask, participants are asked to rank a set of text descriptions in terms of relevance to a given video. Different from that of last year, the matching and ranking subtask of this year split the whole video set into four testing subsets (2, 3, 4, 5) of varying size , in order to measure the impact of the set size on the performance. Concretely, subset 2 includes 1,613 videos, subset 3 includes 795 videos, subset 4 includes 388 videos, and subset 5 includes 159 videos. For subset 2, participants are asked to rank two independent description sets: A and B. Subset 3 has three description sets: A, B, ,and C. Subset 4 has four description sets: A, B, C ,and D. Subset 5 has five description sets: A, B, C , D, and E.

[1]  Jonas Mueller,et al.  Siamese Recurrent Architectures for Learning Sentence Similarity , 2016, AAAI.

[2]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  Ting Yao,et al.  VIREO @ TRECVID 2014: Instance Search and Semantic Indexing , 2014, TRECVID.

[4]  Georges Quénot,et al.  TRECVID 2017: Evaluating Ad-hoc and Instance Video Search, Events Detection, Video Captioning and Hyperlinking , 2017, TRECVID.

[5]  Chong-Wah Ngo,et al.  Concept-Based Interactive Search System , 2017, MMM.

[6]  Dennis Koelma,et al.  The ImageNet Shuffle: Reorganized Pre-training for Video Event Detection , 2016, ICMR.

[7]  Paul Over,et al.  Creating HAVIC: Heterogeneous Audio Visual Internet Collection , 2012, LREC.

[8]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[9]  Lori Lamel Multilingual Speech Processing Activities in Quaero: Application to Multimedia Search in Unstructured Data , 2012, Baltic HLT.

[10]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[11]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[12]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[13]  Shih-Fu Chang,et al.  Exploiting Feature and Class Relationships in Video Categorization with Regularized Deep Neural Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Chong-Wah Ngo,et al.  On the Selection of Anchors and Targets for Video Hyperlinking , 2017, ICMR.

[15]  Siu Cheung Hui,et al.  Learning to Rank Question Answer Pairs with Holographic Dual LSTM Architecture , 2017, SIGIR.

[16]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Luca Rossetto,et al.  Interactive video search tools: a detailed analysis of the video browser showdown 2015 , 2016, Multimedia Tools and Applications.

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Chong-Wah Ngo,et al.  Enhanced VIREO KIS at VBS 2018 , 2018, MMM.

[20]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Chong-Wah Ngo,et al.  Event Detection with Zero Example: Select the Right and Suppress the Wrong Concepts , 2016, ICMR.

[22]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[23]  Pascale Sébillot,et al.  Exploiting Multimodality in Video Hyperlinking to Improve Target Diversity , 2017, MMM.

[24]  Georgios Balikas,et al.  Topical Coherence in LDA-based Models through Induced Segmentation , 2017, ACL.