A Study on Multimodal Video Hyperlinking with Visual Aggregation

Video hyperlinking offers a way to explore a video collection, making use of links that connect segments having related content. Hyperlinking systems thus seek to automatically create links by connecting given anchor segments to relevant targets within the collection. In this paper, we further investigate multimodal representations of video segments in a hyper-linking system based on bidirectional deep neural networks, which achieved state-of-the-art results in the TRECVid 2016 evaluation. A systematic study of different input representations is done with a focus on the aggregation of the representation of multiple keyframes. This includes, in particular, the use of memory vectors as a novel aggregation technique, which provides a significant improvement over other aggregation methods on the final hyperlinking task. Additionally, the use of metadata is investigated leading to increased performance and lower computational requirements for the system.

[1]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[3]  Jean-Luc Gauvain,et al.  The LIMSI Broadcast News transcription system , 2002, Speech Commun..

[4]  Michael Rabbat,et al.  Memory Vectors for Similarity Search in High-Dimensional Spaces , 2014, IEEE Transactions on Big Data.

[5]  Jongwook Choi,et al.  End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Jiri Matas,et al.  Visual Descriptors in Methods for Video Hyperlinking , 2017, ICMR.

[7]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Maria Eskevich,et al.  The Search and Hyperlinking Task at MediaEval 2013 , 2013, MediaEval.

[9]  Chong-Wah Ngo,et al.  Practical elimination of near-duplicates from web video search , 2007, ACM Multimedia.

[10]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Pascale Sébillot,et al.  IRISA at TrecVid2016: Crossmodality, Multimodality and Monomodality for Video Hyperlinking , 2016, TRECVID.

[13]  Raphaël Troncy,et al.  Automatic fine-grained hyperlinking of videos within a closed collection using scene segmentation , 2014, ACM Multimedia.

[14]  Guillaume Gravier,et al.  Multimodal and Crossmodal Representation Learning from Textual and Visual Features with Bidirectional Deep Neural Networks for Video Hyperlinking , 2016, iV&L-MM@MM.

[15]  Jonathan G. Fiscus,et al.  TRECVID 2016: Evaluating Video Search, Video Event Detection, Localization, and Hyperlinking , 2016, TRECVID.

[16]  Guillaume Gravier,et al.  Bidirectional Joint Representation Learning with Symmetrical Deep Neural Networks for Multimodal and Crossmodal Applications , 2016, ICMR.

[17]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[18]  Bernd Girod,et al.  Temporal aggregation for large-scale query-by-image video retrieval , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[19]  Rik Van de Walle,et al.  Multimedia information seeking through search and hyperlinking , 2013, ICMR.

[20]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[21]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.