IRISA at TrecVid 2017: Beyond Crossmodal and Multimodal Models for Video Hyperlinking

This paper presents the runs that were submitted to the TRECVid Challenge 2017 for the Video Hyperlinking task. The goal of the task is to propose a list of video segments, called targets, to complement query video segments defined as anchors. The data provided with the task encourage participants to make use of multiple modalities such as the audio track and the keyframes. In this context, we submitted four runs: 1) BiDNNFull uses a BiDNN model to combine ResNet with Word2Vec; 2) BiDNNFilter makes use of the same model and also exploits the metadata to narrow down the list of possible candidates; 3) BiDNNPinv tries to improve on the anchor keyframe fusion by using the Moore-Penrose pseudo-inverse and finally 4) noBiDNNPinv tests on the relevance of not using a BiDNN to fuse the modalities. Our runs were built based on a pre-trained model of ResNet as well as the transcripts and the metadata provided by the organizers of the task. The results show a gain in performance over the baseline BiDNN model both when the metadata filter was used and when the keyframe fusion was done with a pseudo-inverse.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Martha Larson,et al.  Multimodal Video-to-Video Linking: Turning to the Crowd for Insight and Evaluation , 2017, MMM.

[3]  Pascale Sébillot,et al.  IRISA at TrecVid2016: Crossmodality, Multimodality and Monomodality for Video Hyperlinking , 2016, TRECVID.

[4]  Georges Quénot,et al.  TRECVID 2017: Evaluating Ad-hoc and Instance Video Search, Events Detection, Video Captioning and Hyperlinking , 2017, TRECVID.

[5]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[6]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Jonathan G. Fiscus,et al.  TRECVID 2016: Evaluating Video Search, Video Event Detection, Localization, and Hyperlinking , 2016, TRECVID.

[8]  Ronan Sicre,et al.  Memory Vectors for Particular Object Retrieval with Multiple Queries , 2015, ICMR.

[9]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[11]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[12]  Guillaume Gravier,et al.  Bidirectional Joint Representation Learning with Symmetrical Deep Neural Networks for Multimodal and Crossmodal Applications , 2016, ICMR.

[13]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).