Multi-task deep visual-semantic embedding for video thumbnail selection

Given the tremendous growth of online videos, video thumbnail, as the common visualization form of video content, is becoming increasingly important to influence user's browsing and searching experience. However, conventional methods for video thumbnail selection often fail to produce satisfying results as they ignore the side semantic information (e.g., title, description, and query) associated with the video. As a result, the selected thumbnail cannot always represent video semantics and the click-through rate is adversely affected even when the retrieved videos are relevant. In this paper, we have developed a multi-task deep visual-semantic embedding model, which can automatically select query-dependent video thumbnails according to both visual and side information. Different from most existing methods, the proposed approach employs the deep visual-semantic embedding model to directly compute the similarity between the query and video thumbnails by mapping them into a common latent semantic space, where even unseen query-thumbnail pairs can be correctly matched. In particular, we train the embedding model by exploring the large-scale and freely accessible click-through video and image data, as well as employing a multi-task learning strategy to holistically exploit the query-thumbnail relevance from these two highly related datasets. Finally, a thumbnail is selected by fusing both the representative and query relevance scores. The evaluations on 1,000 query-thumbnail dataset labeled by 191 workers in Amazon Mechanical Turk have demonstrated the effectiveness of our proposed method.

[1]  Xian-Sheng Hua,et al.  To learn representativeness of video frames , 2005, MULTIMEDIA '05.

[2]  Meng Wang,et al.  Event Driven Web Video Summarization by Tag Localization and Key-Shot Identification , 2012, IEEE Transactions on Multimedia.

[3]  Yael Pritch,et al.  Making a Long Video Short: Dynamic Video Synopsis , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[4]  Chih-Jen Lin,et al.  Large-Scale Video Summarization Using Web-Image Priors , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[6]  Qi Tian,et al.  Multimedia search reranking: A literature survey , 2014, CSUR.

[7]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[8]  Alberto Del Bimbo,et al.  Enriching and localizing semantic tags in internet videos , 2011, ACM Multimedia.

[9]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Bin Liu,et al.  Localizing relevant frames in web videos using topic model and relevance filtering , 2013, Machine Vision and Applications.

[11]  Jiebo Luo,et al.  Towards Extracting Semantically Meaningful Key Frames From Personal Video Clips: From Humans to Computers , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[12]  Kilian Q. Weinberger,et al.  Large Margin Multi-Task Metric Learning , 2010, NIPS.

[13]  Wen Gao,et al.  Web video thumbnail recommendation with content-aware analysis and query-sensitive matching , 2013, Multimedia Tools and Applications.

[14]  Yongdong Zhang,et al.  Instant Mobile Video Search With Layered Audio-Video Indexing and Progressive Transmission , 2014, IEEE Transactions on Multimedia.

[15]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[16]  Fei-Fei Li,et al.  Discriminative Segment Annotation in Weakly Labeled Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Qingming Huang,et al.  Query sensitive dynamic web video thumbnail generation , 2011, 2011 18th IEEE International Conference on Image Processing.

[18]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[19]  Ji Wan,et al.  Deep Learning for Content-Based Image Retrieval: A Comprehensive Study , 2014, ACM Multimedia.

[20]  Jiayu Zhou,et al.  Efficient multi-task feature learning with calibration , 2014, KDD.

[21]  Chong-Wah Ngo,et al.  Click-through-based cross-view learning for image search , 2014, SIGIR.

[22]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[23]  Kristen Grauman,et al.  Story-Driven Summarization for Egocentric Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Bin Zhao,et al.  Quasi Real-Time Summarization for Consumer Videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Gildas Morvan,et al.  A literature survey , 2013 .