Multi-modal tag localization for mobile video search

Given the tremendous growth of mobile videos, video tag localization, which localizes the relevant video clips for an associated semantic tag, is becoming increasingly important to influence users browsing and searching experience. However, most existing approaches adopt and depend to large degree on carefully selected visual features, which are manually designed by experts and do not take multi-modality into consideration. Aiming to take into account complementarity of different modalities, in this paper, we propose a multi-modal tag localization framework by exploiting deep learning to learn visual, auditory, and semantic features of videos for tag localization. Furthermore, we showcase that the framework can be applied to two novel mobile video search applications: (1) automatic time-code-level tags generation and (2) query-dependent video thumbnail selection. Extensive experiments on the public dataset show that the proposed approach achieves promising results, which obtains $$7.6~\%$$7.6% improvement beyond the state-of-the-arts. Finally, the subjective evaluation of usability demonstrates the proposed applications can significantly improve the user’s mobile video search experience.

[1]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[3]  Yongdong Zhang,et al.  Instant Mobile Video Search With Layered Audio-Video Indexing and Progressive Transmission , 2014, IEEE Transactions on Multimedia.

[4]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Weifeng Liu,et al.  Multiview Hessian Regularization for Image Annotation , 2013, IEEE Transactions on Image Processing.

[6]  Adrian Ulges,et al.  Multiple Instance Learning from Weakly Labeled Videos , 2009 .

[7]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Sheng Tang,et al.  Multimodal tag localization based on deep learning , 2015, ICIMCS '15.

[10]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[11]  Nitish Srivastava,et al.  Multimodal learning with deep Boltzmann machines , 2012, J. Mach. Learn. Res..

[12]  Adrian Ulges,et al.  Identifying relevant frames in weakly labeled videos for training concept detectors , 2008, CIVR '08.

[13]  Bin Liu,et al.  Localizing relevant frames in web videos using topic model and relevance filtering , 2013, Machine Vision and Applications.

[14]  Pietro Perona,et al.  Learning object categories from Google's image search , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[15]  Yong Wang,et al.  Coherent image annotation by learning semantic distance , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[17]  Wei-Ta Chu,et al.  Tag suggestion and localization for web videos by bipartite graph matching , 2011, WSM '11.

[18]  Xian-Sheng Hua,et al.  To learn representativeness of video frames , 2005, MULTIMEDIA '05.

[19]  Jiebo Luo,et al.  Towards Extracting Semantically Meaningful Key Frames From Personal Video Clips: From Humans to Computers , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[20]  Yuan Yan Tang,et al.  Multiview Hessian discriminative sparse coding for image annotation , 2013, Comput. Vis. Image Underst..

[21]  Alberto Del Bimbo,et al.  Tag suggestion and localization in user-generated videos based on social knowledge , 2010, WSM@MM.

[22]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[23]  Haojie Li,et al.  DUT-WEBV: A Benchmark Dataset for Performance Evaluation of Tag Localization for Web Video , 2013, MMM.

[24]  Tao Mei,et al.  Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[26]  Alberto Del Bimbo,et al.  A data-driven approach for tag refinement and localization in web videos , 2015, Comput. Vis. Image Underst..

[27]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[28]  Meng Wang,et al.  ShotTagger: tag location for internet videos , 2011, ICMR.

[29]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Dacheng Tao,et al.  Multi-View Intact Space Learning , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Dacheng Tao,et al.  Large-Margin Multi-ViewInformation Bottleneck , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[33]  Tao Mei,et al.  Contextual Video Recommendation by Multimodal Relevance and User Feedback , 2011, TOIS.

[34]  Yongdong Zhang,et al.  Multi-task deep visual-semantic embedding for video thumbnail selection , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Xuelong Li,et al.  Hessian Regularized Support Vector Machines for Mobile Image Annotation on the Cloud , 2013, IEEE Transactions on Multimedia.

[36]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.