A General Framework for Edited Video and Raw Video Summarization

In this paper, we build a general summarization framework for both of edited video and raw video summarization. Overall, our work can be divided into three folds. 1) Four models are designed to capture the properties of video summaries, i.e., containing important people and objects (importance), representative to the video content (representativeness), no similar key-shots (diversity), and smoothness of the storyline (storyness). Specifically, these models are applicable to both edited videos and raw videos. 2) A comprehensive score function is built with the weighted combination of the aforementioned four models. Note that the weights of the four models in the score function, denoted as property-weight, are learned in a supervised manner. Besides, the property-weights are learned for edited videos and raw videos, respectively. 3) The training set is constructed with both edited videos and raw videos in order to make up the lack of training data. Particularly, each training video is equipped with a pair of mixing-coefficients, which can reduce the structure mess in the training set caused by the rough mixture. We test our framework on three data sets, including edited videos, short raw videos, and long raw videos. Experimental results have verified the effectiveness of the proposed framework.

[1]  John R. Kender,et al.  Video Summaries through Mosaic-Based Shot and Scene Clustering , 2002, ECCV.

[2]  Hui Lin,et al.  A Class of Submodular Functions for Document Summarization , 2011, ACL.

[3]  Ananda S. Chowdhury,et al.  Video key frame extraction through dynamic Delaunay clustering with a structural constraint , 2013, J. Vis. Commun. Image Represent..

[4]  Yueting Zhuang,et al.  Adaptive key frame extraction using unsupervised clustering , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[5]  Qi Wang,et al.  Locality constraint distance metric learning for traffic congestion detection , 2018, Pattern Recognit..

[6]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[7]  Xuelong Li,et al.  Latent Semantic Minimal Hashing for Image Retrieval , 2017, IEEE Transactions on Image Processing.

[8]  Deva Ramanan,et al.  Detecting activities of daily living in first-person camera views , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Gang Hua,et al.  VideoCut: Removing Irrelevant Frames by Discovering the Object of Interest , 2008, ECCV.

[10]  Ling Shao,et al.  Cosaliency Detection Based on Intrasaliency Prior Transfer and Deep Intersaliency Mining , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[11]  Chong-Wah Ngo,et al.  Automatic video summarization by graph modeling , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[12]  Kristen Grauman,et al.  Diverse Sequential Subset Selection for Supervised Video Summarization , 2014, NIPS.

[13]  Xiangtao Zheng,et al.  Joint Dictionary Learning for Multispectral Change Detection , 2017, IEEE Transactions on Cybernetics.

[14]  Hui Lin,et al.  Graph-based submodular selection for extractive summarization , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[15]  Satoru Fujishige,et al.  Submodular functions and optimization , 1991 .

[16]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[17]  Rishabh K. Iyer,et al.  Learning Mixtures of Submodular Functions for Image Collection Summarization , 2014, NIPS.

[18]  Yu Zhang,et al.  High-level representation sketch for video event retrieval , 2015, Science China Information Sciences.

[19]  Vahab S. Mirrokni,et al.  Maximizing Non-Monotone Submodular Functions , 2011, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[20]  Gang Hua,et al.  A Hierarchical Visual Model for Video Object Summarization , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Xuelong Li,et al.  Robust Video Object Cosegmentation , 2015, IEEE Transactions on Image Processing.

[22]  Jiajun Bu,et al.  Video Summarization based on Nonnegative Linear Reconstruction , 2014, 2014 IEEE International Conference on Multimedia and Expo (ICME).

[23]  Limin Wang,et al.  Places205-VGGNet Models for Scene Recognition , 2015, ArXiv.

[24]  Ling Shao,et al.  Compressive Sequential Learning for Action Similarity Labeling , 2016, IEEE Transactions on Image Processing.

[25]  Ling Shao,et al.  Video abstraction based on fMRI-driven visual attention model , 2014, Inf. Sci..

[26]  Yelena Yesha,et al.  Keyframe-based video summarization using Delaunay clustering , 2006, International Journal on Digital Libraries.

[27]  C. Schmid,et al.  Category-Specific Video Summarization , 2014, ECCV.

[28]  Yale Song,et al.  Video co-summarization: Video summarization by visual co-occurrence , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Amit K. Roy-Chowdhury,et al.  Context-Aware Surveillance Video Summarization , 2016, IEEE Transactions on Image Processing.

[31]  Nathan D. Ratliff,et al.  Subgradient Methods for Maximum Margin Structured Learning , 2006 .

[32]  Rushil Anirudh,et al.  Diversity promoting online sampling for streaming video summarization , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[33]  Guillermo Sapiro,et al.  See all by looking at a few: Sparse modeling for finding representative objects , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Rajiv Ranjan,et al.  IK-SVD: Dictionary Learning for Spatial Big Data via Incremental Atom Update , 2014, Computing in Science & Engineering.

[35]  Luc Van Gool,et al.  Video summarization by learning submodular mixtures of objectives , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Yael Pritch,et al.  Making a Long Video Short: Dynamic Video Synopsis , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[37]  Rama Chellappa,et al.  Video Précis: Highlighting Diverse Aspects of Videos , 2010, IEEE Transactions on Multimedia.

[38]  Kristen Grauman,et al.  Story-Driven Summarization for Egocentric Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Bin Zhao,et al.  Quasi Real-Time Summarization for Consumer Videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Michel Minoux,et al.  Accelerated greedy algorithms for maximizing submodular set functions , 1978 .

[41]  Xuelong Li,et al.  Surveillance Video Synopsis via Scaling Down Objects , 2016, IEEE Transactions on Image Processing.

[42]  Hui Lin,et al.  Learning Mixtures of Submodular Shells with Application to Document Summarization , 2012, UAI.

[43]  Xuelong Li,et al.  Video parsing via spatiotemporally analysis with images , 2015, Multimedia Tools and Applications.

[44]  Xuelong Li,et al.  Rank Preserving Discriminant Analysis for Human Behavior Recognition on Wireless Sensor Networks , 2014, IEEE Transactions on Industrial Informatics.

[45]  Steven M. Seitz,et al.  Scene Summarization for Online Image Collections , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[46]  Yong Jae Lee,et al.  Predicting Important Objects for Egocentric Video Summarization , 2015, International Journal of Computer Vision.

[47]  Ling Shao,et al.  Embedding Motion and Structure Features for Action Recognition , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[48]  Yale Song,et al.  TVSum: Summarizing web videos using titles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Youssef Hadi,et al.  Video summarization by k-medoid clustering , 2006, SAC '06.

[50]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[51]  Arnaldo de Albuquerque Araújo,et al.  VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method , 2011, Pattern Recognit. Lett..

[52]  Yuan Yuan,et al.  Congested scene classification via efficient unsupervised feature learning and density estimation , 2016, Pattern Recognit..

[53]  Tao Mei,et al.  Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Ling Shao,et al.  Supervised Matrix Factorization Hashing for Cross-Modal Retrieval , 2016, IEEE Transactions on Image Processing.