A Bag-of-Importance Model With Locality-Constrained Coding Based Feature Learning for Video Summarization

Video summarization helps users obtain quick comprehension of video content. Recently, some studies have utilized local features to represent each video frame and formulate video summarization as a coverage problem of local features. However, the importance of individual local features has not been exploited. In this paper, we propose a novel Bag-of-Importance (BoI) model for static video summarization by identifying the frames with important local features as keyframes, which is one of the first studies formulating video summarization at local feature level, instead of at global feature level. That is, by representing each frame with local features, a video is characterized with a bag of local features weighted with individual importance scores and the frames with more important local features are more representative, where the representativeness of each frame is the aggregation of the weighted importance of the local features contained in the frame. In addition, we propose to learn a transformation from a raw local feature to a more powerful sparse nonlinear representation for deriving the importance score of each local feature, rather than directly utilize the hand-crafted visual features like most of the existing approaches. Specifically, we first employ locality-constrained linear coding (LCC) to project each local feature into a sparse transformed space. LCC is able to take advantage of the manifold geometric structure of the high dimensional feature space and form the manifold of the low dimensional transformed space with the coordinates of a set of anchor points. Then we calculate the l2 norm of each anchor point as the importance score of each local feature which is projected to the anchor point. Finally, the distribution of the importance scores of all the local features in a video is obtained as the BoI representation of the video. We further differentiate the importance of local features with a spatial weighting template by taking the perceptual difference among spatial regions of a frame into account. As a result, our proposed video summarization approach is able to exploit both the inter-frame and intra-frame properties of feature representations and identify keyframes capturing both the dominant content and discriminative details within a video. Experimental results on three video datasets across various genres demonstrate that the proposed approach clearly outperforms several state-of-the-art methods.

[1]  Shiyang Lu,et al.  Keypoint-Based Keyframe Selection , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[2]  Jiebo Luo,et al.  Towards Extracting Semantically Meaningful Key Frames From Personal Video Clips: From Humans to Computers , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[3]  Jianping Fan,et al.  Image collection summarization via dictionary learning for sparse representation , 2013, Pattern Recognit..

[4]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[5]  Lie Lu,et al.  A generic framework of user attention model and its application in video summarization , 2005, IEEE Trans. Multim..

[6]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  Athman Bouguettaya,et al.  An Efficient Near-Duplicate Video Shot Detection Method Using Shot-Based Interest Points , 2009, IEEE Transactions on Multimedia.

[8]  Harry W. Agius,et al.  Video summarisation: A conceptual framework and survey of the state of the art , 2008, J. Vis. Commun. Image Represent..

[9]  Jhing-Fa Wang,et al.  A Novel Video Summarization Based on Mining the Story-Structure and Semantic Relations Among Concept Entities , 2009, IEEE Transactions on Multimedia.

[10]  Tao Chen,et al.  Orthogonal Muxing Frame Compatible Full Resolution technology for multi-resolution frame-compatible stereo coding , 2013, 2013 IEEE International Conference on Multimedia and Expo (ICME).

[11]  Chong-Wah Ngo,et al.  Scale-Rotation Invariant Pattern Entropy for Keypoint-Based Near-Duplicate Detection , 2009, IEEE Transactions on Image Processing.

[12]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[13]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Ba Tu Truong,et al.  Video abstraction: A systematic review and classification , 2007, TOMCCAP.

[15]  Junsong Yuan,et al.  Thematic Saliency Detection Using Spatial-Temporal Context , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[16]  George Economou,et al.  Combining graph connectivity & dominant set clustering for video summarization , 2009, Multimedia Tools and Applications.

[17]  Marco Pellegrini,et al.  STIMO: STIll and MOving video storyboard for the web scenario , 2009, Multimedia Tools and Applications.

[18]  Jiebo Luo,et al.  Towards Scalable Summarization of Consumer Videos Via Sparse Dictionary Selection , 2012, IEEE Transactions on Multimedia.

[19]  Chiou-Ting Hsu,et al.  Fusion of audio and motion information on HMM-based highlight extraction for baseball games , 2006, IEEE Transactions on Multimedia.

[20]  Manfred del Fabro,et al.  Global vs . Local Feature in Video Summarization : Experimental Results , 2009 .

[21]  Yelena Yesha,et al.  Keyframe-based video summarization using Delaunay clustering , 2006, International Journal on Digital Libraries.

[22]  Aggelos K. Katsaggelos,et al.  MINMAX optimal video summarization , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[23]  Tao Mei,et al.  A bag-of-importance model for video summarization , 2013, 2013 IEEE International Conference on Multimedia and Expo Workshops (ICMEW).

[24]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[25]  Yihong Gong,et al.  Nonlinear Learning using Local Coordinate Coding , 2009, NIPS.

[26]  Chong-Wah Ngo,et al.  Video summarization and scene detection by graph modeling , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[27]  Xin Liu,et al.  Video summarization using singular value decomposition , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[28]  Yueting Zhuang,et al.  Adaptive key frame extraction using unsupervised clustering , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[29]  Liqing Zhang,et al.  Saliency Detection: A Spectral Residual Approach , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Stan Szpakowicz,et al.  Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation , 2006, Australian Conference on Artificial Intelligence.

[31]  Yong Jae Lee,et al.  Key-segments for video object segmentation , 2011, 2011 International Conference on Computer Vision.

[32]  Marcel van Herk A fast algorithm for local minimum and maximum filters on rectangular and octagonal kernels , 1992, Pattern Recognit. Lett..

[33]  Cordelia Schmid,et al.  A Performance Evaluation of Local Descriptors , 2005, IEEE Trans. Pattern Anal. Mach. Intell..

[34]  Shaohui Mei,et al.  Video Summarization with Global and Local Features , 2012, 2012 IEEE International Conference on Multimedia and Expo Workshops.

[35]  Georgios Tziritas,et al.  Equivalent Key Frames Selection Based on Iso-Content Principles , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[36]  Vincent Lepetit,et al.  Are sparse representations really relevant for image classification? , 2011, CVPR 2011.

[37]  Arnaldo de Albuquerque Araújo,et al.  VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method , 2011, Pattern Recognit. Lett..

[38]  David S. Doermann,et al.  Video summarization by curve simplification , 1998, MULTIMEDIA '98.