Neighbourhood Structure Preserving Cross-Modal Embedding for Video Hyperlinking

Video hyperlinking is a task aiming to enhance the accessibility of large archives, by establishing links between fragments of videos. The links model the aboutness between fragments for efficient traversal of video content. This paper addresses the problem of link construction from the perspective of cross-modal embedding. To this end, a generalized multi-modal auto-encoder is proposed. The encoder learns two embeddings from visual and speech modalities, respectively, whereas each of the embeddings performs self-modal and cross-modal translation of modalities. Furthermore, to preserve the neighbourhood structure of fragments, which is important for video hyperlinking, the auto-encoder is devised to model data distribution of fragments in a dataset. Experiments are conducted on Blip10000 dataset using the anchor fragments provided by TRECVid Video Hyperlinking (LNK) task over the years of 2016 and 2017. This paper shares the empirical insights on a number of issues in cross-modal learning, including the preservation of neighbourhood structure in embedding, model fine-tuning and issue of missing modality, for video hyperlinking.

[1]  Ruifan Li,et al.  Cross-modal Retrieval with Correspondence Autoencoder , 2014, ACM Multimedia.

[2]  Georges Quénot,et al.  TRECVID 2017: Evaluating Ad-hoc and Instance Video Search, Events Detection, Video Captioning and Hyperlinking , 2017, TRECVID.

[3]  Meng Wang,et al.  Event Driven Web Video Summarization by Tag Localization and Key-Shot Identification , 2012, IEEE Transactions on Multimedia.

[4]  Geoffrey E. Hinton,et al.  Stochastic Neighbor Embedding , 2002, NIPS.

[5]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[6]  Dennis Koelma,et al.  The ImageNet Shuffle: Reorganized Pre-training for Video Event Detection , 2016, ICMR.

[7]  Tao Mei,et al.  Deep Collaborative Embedding for Social Image Understanding , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Maria Eskevich,et al.  Defining and Evaluating Video Hyperlinking for Navigating Multimedia Archives , 2015, WWW.

[10]  Chien-Li Chou,et al.  Pattern-Based Near-Duplicate Video Retrieval and Localization on Web-Scale Videos , 2015, IEEE Transactions on Multimedia.

[11]  Dacheng Tao,et al.  Multi-View Object Retrieval via Multi-Scale Topic Models. , 2016, IEEE transactions on image processing : a publication of the IEEE Signal Processing Society.

[12]  H. T. Kung,et al.  Multimodal sparse representation learning and applications , 2015, Journal of AI Humanities.

[13]  Meng Wang,et al.  Unsupervised t-Distributed Video Hashing and Its Deep Hashing Extension , 2017, IEEE Transactions on Image Processing.

[14]  Meng Wang,et al.  Stochastic Multiview Hashing for Large-Scale Near-Duplicate Video Retrieval , 2017, IEEE Transactions on Multimedia.

[15]  Gareth J. F. Jones,et al.  Evaluating Search and Hyperlinking: An Example of the Design, Test, Refine Cycle for Metric Development , 2015, MediaEval.

[16]  G Salton,et al.  Developments in Automatic Text Retrieval , 1991, Science.

[17]  Changsheng Xu,et al.  Text2Video: An End-to-end Learning Framework for Expressing Text With Videos , 2018, IEEE Transactions on Multimedia.

[18]  Jian Wang,et al.  Cross-Modal Retrieval via Deep and Bidirectional Representation Learning , 2016, IEEE Transactions on Multimedia.

[19]  Chong-Wah Ngo,et al.  Serendipity-driven Celebrity Video Hyperlinking , 2016, ICMR.

[20]  Jonathan G. Fiscus,et al.  TRECVID 2016: Evaluating Video Search, Video Event Detection, Localization, and Hyperlinking , 2016, TRECVID.

[21]  Marie-Francine Moens,et al.  Hierarchical Topic Models for Language-based Video Hyperlinking , 2015, SLAM@ACM Multimedia.

[22]  Jiwen Lu,et al.  Deep Coupled Metric Learning for Cross-Modal Matching , 2017, IEEE Transactions on Multimedia.

[23]  Guillaume Gravier,et al.  Bidirectional Joint Representation Learning with Symmetrical Deep Neural Networks for Multimodal and Crossmodal Applications , 2016, ICMR.

[24]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[25]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[26]  Guillaume Gravier,et al.  IRISA at TrecVid 2017: Beyond Crossmodal and Multimodal Models for Video Hyperlinking , 2017, TRECVID.

[27]  Zi Huang,et al.  Effective Multiple Feature Hashing for Large-Scale Near-Duplicate Video Retrieval , 2013, IEEE Transactions on Multimedia.

[28]  Kunio Fukunaga,et al.  Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions , 2002, International Journal of Computer Vision.

[29]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Tingting Mu,et al.  Data Visualization with Structural Control of Global Cohort and Local Data Neighborhoods , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Maria Eskevich,et al.  Linking inside a video collection: what and how to measure? , 2013, WWW.

[32]  Jarkko Venna,et al.  Information Retrieval Perspective to Nonlinear Dimensionality Reduction for Data Visualization , 2010, J. Mach. Learn. Res..

[33]  Guillaume Gravier,et al.  Multimodal and Crossmodal Representation Learning from Textual and Visual Features with Bidirectional Deep Neural Networks for Video Hyperlinking , 2016, iV&L-MM@MM.

[34]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[35]  Bingbing Ni,et al.  Image Classification by Selective Regularized Subspace Learning , 2016, IEEE Transactions on Multimedia.

[36]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[37]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[38]  Meng Wang,et al.  Neighborhood Discriminant Hashing for Large-Scale Image Retrieval , 2015, IEEE Transactions on Image Processing.

[39]  Qing Li,et al.  VIREO @ TRECVID 2017: Video-to-Text, Ad-hoc Video Search, and Video hyperlinking , 2017, TRECVID.

[40]  Meng Wang,et al.  Coherent Semantic-Visual Indexing for Large-Scale Image Retrieval in the Cloud , 2017, IEEE Transactions on Image Processing.

[41]  Elena Baralis,et al.  Eurecom-Polito at TRECVID 2017: Hyperlinking task , 2016, TRECVID.

[42]  Guillaume Gravier,et al.  Generative Adversarial Networks for Multimodal Representation Learning in Video Hyperlinking , 2017, ICMR.

[43]  Maria Eskevich,et al.  The Search and Hyperlinking Task at MediaEval 2013 , 2013, MediaEval.

[44]  Meng Wang,et al.  Cross-Domain Sentiment Encoding through Stochastic Word Embedding , 2020, IEEE Transactions on Knowledge and Data Engineering.

[45]  Jinhui Tang,et al.  Weakly Supervised Deep Metric Learning for Community-Contributed Image Retrieval , 2015, IEEE Transactions on Multimedia.

[46]  Jun Wu,et al.  Joint Compression of Near-Duplicate Videos , 2017, IEEE Transactions on Multimedia.

[47]  Jean-Luc Gauvain,et al.  Speech Processing for Audio Indexing , 2008, GoTAL.

[48]  Maria Eskevich,et al.  Convenient Discovery of Archived Video Using Audiovisual Hyperlinking , 2015, SLAM@ACM Multimedia.

[49]  Martha Larson,et al.  Blip10000: a social video dataset containing SPUG content for tagging and retrieval , 2013, MMSys.

[50]  Chong-Wah Ngo,et al.  On the Selection of Anchors and Targets for Video Hyperlinking , 2017, ICMR.

[51]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[52]  Benoit Huet,et al.  EURECOM at TRECVID 2016: The Adhoc Video Search and Video Hyperlinking Tasks , 2016, TRECVID.

[53]  Meng Wang,et al.  Multimodal Graph-Based Reranking for Web Image Search , 2012, IEEE Transactions on Image Processing.

[54]  Meng Wang,et al.  Movie2Comics: Towards a Lively Video Content Presentation , 2012, IEEE Transactions on Multimedia.

[55]  Amaia Salvador,et al.  Learning Cross-Modal Embeddings for Cooking Recipes and Food Images , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Qi Tian,et al.  Generalized Semi-supervised and Structured Subspace Learning for Cross-Modal Retrieval , 2018, IEEE Transactions on Multimedia.