Exploiting Web Images for Video Highlight Detection With Triplet Deep Ranking

Highlight detection from videos has been widely studied due to the fast growth of video contents. However, most existing approaches to highlight detection, either handcraft feature based or deep learning based, heavily rely on human-curated training data, which is very expensive to obtain and, thus, hinders the scalability to large datasets and unlabeled video categories. We observe that the largely available Web images can be applied as a weak supervision for highlight detection. For example, the top-ranked images in reference to the query “skiing” returned by a search engine may contain considerable positive samples of “skiing” highlights. Motivated by this observation, we propose a novel triplet deep ranking approach to video highlight detection using Web images as a weak supervision. The approach handles the relative preference of highlight scores between highlighting frames, nonhighlighting frames, and Web images by the triplet ranking constraints. Our approach can iteratively train two interdependent deep models (i.e., a triplet highlight model and a pairwise noise model) to deal with the noisy Web images in a single framework. We train the two models with relative preferences to generalize the capability regardless of the categories of training data. Therefore, our approach is fully category independent and exploits weakly supervised Web images. We evaluate our approach on two challenging datasets and achieve impressive results compared with the state-of-the-art pairwise ranking support vector machines, a robust recurrent autoencoder, and spatial deep convolution neural networks. We also empirically verify through cross-dataset evaluation that our category-independent model is fairly generalizable even if two different datasets do not share exactly the same categories.

[1]  Tao Mei,et al.  Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[3]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Hiroshi Ishikawa,et al.  Fashion Style in 128 Floats: Joint Ranking and Classification Using Weak Data for Feature Extraction , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Tieniu Tan,et al.  Deep semantic ranking based hashing for multi-label image retrieval , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Yuzhen Niu,et al.  Using Web Photos for Measuring Video Frame Interestingness , 2009, IJCAI.

[8]  Alan Hanjalic,et al.  Adaptive extraction of highlights from a sport video based on excitement modeling , 2005, IEEE Transactions on Multimedia.

[9]  Xi Wang,et al.  Multi-Stream Multi-Class Fusion of Deep Networks for Video Classification , 2016, ACM Multimedia.

[10]  Yale Song,et al.  TVSum: Summarizing web videos using titles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Yale Song,et al.  Video2GIF: Automatic Generation of Animated GIFs from Video , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Yang Song,et al.  Learning Fine-Grained Image Similarity with Deep Ranking , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Ke Zhang,et al.  Summary Transfer: Exemplar-Based Subset Selection for Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Hiroshi Murase,et al.  Event Detection based on Twitter Enthusiasm Degree for Generating a Sports Highlight Video , 2014, ACM Multimedia.

[15]  Alberto Del Bimbo,et al.  A data-driven approach for tag refinement and localization in web videos , 2015, Comput. Vis. Image Underst..

[16]  Geoffrey E. Hinton,et al.  Learning to Label Aerial Images from Noisy Data , 2012, ICML.

[17]  Reuven Y. Rubinstein,et al.  Optimization of computer simulation models with rare events , 1997 .

[18]  Sheng Tang,et al.  Multi-modal tag localization for mobile video search , 2016, Multimedia Systems.

[19]  Tao Mei,et al.  A Bag-of-Importance Model With Locality-Constrained Coding Based Feature Learning for Video Summarization , 2014, IEEE Transactions on Multimedia.

[20]  Andrea Cavallaro,et al.  Resource Allocation for Personalized Video Summarization , 2014, IEEE Transactions on Multimedia.

[21]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[23]  Tao Mei,et al.  Building a comprehensive ontology to refine video concept detection , 2007, MIR '07.

[24]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[25]  Ananda S. Chowdhury,et al.  Multi-View Video Summarization Using Bipartite Matching Constrained Optimum-Path Forest Clustering , 2015, IEEE Transactions on Multimedia.

[26]  Nagarajan Natarajan,et al.  Learning with Noisy Labels , 2013, NIPS.

[27]  Tao Mei,et al.  Relaxing from Vocabulary: Robust Weakly-Supervised Deep Learning for Vocabulary-Free Image Tagging , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[28]  Chih-Jen Lin,et al.  Large-Scale Video Summarization Using Web-Image Priors , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[30]  Hongan Wang,et al.  An Interactive SpiralTape Video Summarization , 2016, IEEE Transactions on Multimedia.

[31]  Xiaogang Wang,et al.  Learning from massive noisy labeled data for image classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Hao Tang,et al.  Detecting highlights in sports videos: Cricket as a test case , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[33]  Hanjiang Lai,et al.  Simultaneous feature learning and hash coding with deep neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Ke Zhang,et al.  Video Summarization with Long Short-Term Memory , 2016, ECCV.

[35]  Ali Farhadi,et al.  Ranking Domain-Specific Highlights by Analyzing Edited Videos , 2014, ECCV.

[36]  Jian-Huang Lai,et al.  Deep Ranking for Person Re-Identification via Joint Representation Learning , 2015, IEEE Transactions on Image Processing.

[37]  Joan Bruna,et al.  Training Convolutional Networks with Noisy Labels , 2014, ICLR 2014.

[38]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Eric P. Xing,et al.  Joint Summarization of Large-Scale Collections of Web Images and Videos for Storyline Reconstruction , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Bin Liu,et al.  Localizing relevant frames in web videos using topic model and relevance filtering , 2013, Machine Vision and Applications.

[41]  Jian Sun,et al.  Convolutional neural networks at constrained time cost , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Junsong Yuan,et al.  From Keyframes to Key Objects: Video Summarization by Representative Object Proposal Selection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Bernhard Schölkopf,et al.  A Kernel Method for the Two-Sample-Problem , 2006, NIPS.

[44]  Minyi Guo,et al.  Unsupervised Extraction of Video Highlights via Robust Recurrent Auto-Encoders , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[45]  Vlad I. Morariu,et al.  Summarizing While Recording: Context-Based Highlight Detection for Egocentric Videos , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[46]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[47]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Irfan A. Essa,et al.  Leveraging Contextual Cues for Generating Basketball Highlights , 2016, ACM Multimedia.

[49]  Kristen Grauman,et al.  Relative attributes , 2011, 2011 International Conference on Computer Vision.