A Crossmodal Approach to Multimodal Fusion in Video Hyperlinking

With the recent resurgence of neural networks and the proliferation of massive amounts of unlabeled multimodal data, recommendation systems and multimodal retrieval systems based on continuous representation spaces and deep learning methods are becoming of great interest. Multimodal representations are typically obtained with autoencoders that reconstruct multimodal data. In this article, we describe an alternative method to perform high-level multimodal fusion that leverages crossmodal translation by means of symmetrical encoders cast into a bidirectional deep neural network (BiDNN). Using the lessons learned from multimodal retrieval, we present a BiDNN-based system that performs video hyperlinking and recommends interesting video segments to a viewer. Results established using TRECVIDs 2016 video hyperlinking benchmarking initiative show that our method obtained the best score, thus defining the state of the art.

[1]  Jonathan G. Fiscus,et al.  TRECVID 2016: Evaluating Video Search, Video Event Detection, Localization, and Hyperlinking , 2016, TRECVID.

[2]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[4]  Karel Jezek,et al.  Comparing Semantic Models for Evaluating Automatic Document Summarization , 2015, TSD.

[5]  Guillaume Gravier,et al.  IRISA at TrecVid 2017: Beyond Crossmodal and Multimodal Models for Video Hyperlinking , 2017, TRECVID.

[6]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[7]  Guillaume Gravier,et al.  Bidirectional Joint Representation Learning with Symmetrical Deep Neural Networks for Multimodal and Crossmodal Applications , 2016, ICMR.

[8]  Pascale Sébillot,et al.  HITS and IRISA at MediaEval 2013: Search and Hyperlinking Task , 2013, MediaEval.

[9]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[10]  Tsuhan Chen,et al.  Audio-visual integration in multimodal communication , 1998, Proc. IEEE.

[11]  Georges Quénot,et al.  TRECVID 2017: Evaluating Ad-hoc and Instance Video Search, Events Detection, Video Captioning and Hyperlinking , 2017, TRECVID.

[12]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[13]  Guillaume Gravier,et al.  Multimodal and Crossmodal Representation Learning from Textual and Visual Features with Bidirectional Deep Neural Networks for Video Hyperlinking , 2016, iV&L-MM@MM.

[14]  Andrew Zisserman,et al.  VISOR: Towards On-the-Fly Large-Scale Object Category Retrieval , 2012, ACCV.

[15]  Maria Eskevich,et al.  The Search and Hyperlinking Task at MediaEval 2013 , 2013, MediaEval.

[16]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[17]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.