Detecting shot boundary with sparse coding for video summarization

Abstract Keyframe selection is a common way to summarize video contents. However, delimiting shot boundaries to extract a representative keyframe from each shot is not trivial as most shot boundary techniques are heuristic and sensitive to the types of video transitions. This paper proposes a new shot boundary detection algorithm, that learns a dictionary from the given video using sparse coding and updates atoms in the dictionary, following the philosophy that different shots cannot be reconstructed using the learned dictionary. Technically, our algorithm conducts the learning by simultaneously minimizing the reconstruction loss, restricting the sparsity of the reconstruction matrix, and preserving the structure across patches and frames. Once shot boundaries are determined, one representative keyframe is selected from each shot and then a video summary is constructed by concatenating the representative keyframes through a post process. On two standard video datasets across various genres, i.e., VSUMM and YouTube datasets, our method is shown to be powerful for video summarization with superior performance over several state-of-the-art techniques.

[1]  夏勇 An Iteratively Reweighting Algorithm for Dynamic Video Summarization , 2015 .

[2]  Marco Pellegrini,et al.  STIMO: STIll and MOving video storyboard for the web scenario , 2009, Multimedia Tools and Applications.

[3]  Samy Bengio,et al.  Group Sparse Coding , 2009, NIPS.

[4]  Shaohui Mei,et al.  Video summarization via minimum sparse reconstruction , 2015, Pattern Recognit..

[5]  Bin Zhao,et al.  Quasi Real-Time Summarization for Consumer Videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Ke Zhang,et al.  Video Summarization with Long Short-Term Memory , 2016, ECCV.

[7]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[8]  Luc Van Gool,et al.  Video summarization by learning submodular mixtures of objectives , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Yihong Gong,et al.  Locality-constrained Linear Coding for image classification , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[10]  Raimondo Schettini,et al.  Erratum to: An innovative algorithm for key frame extraction in video summarization , 2006, Journal of Real-Time Image Processing.

[11]  Jiawei Han,et al.  Non-negative Matrix Factorization on Manifold , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[12]  Mohamed A. Ismail,et al.  VGRAPH: An Effective Approach for Generating Static Video Summaries , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[13]  B. S. Manjunath,et al.  Color and texture descriptors , 2001, IEEE Trans. Circuits Syst. Video Technol..

[14]  Xin Liu,et al.  Video summarization using singular value decomposition , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[15]  Ba Tu Truong,et al.  Video abstraction: A systematic review and classification , 2007, TOMCCAP.

[16]  Chong-Wah Ngo,et al.  Click-through-based cross-view learning for image search , 2014, SIGIR.

[17]  Yelena Yesha,et al.  Keyframe-based video summarization using Delaunay clustering , 2006, International Journal on Digital Libraries.

[18]  Shaohui Mei,et al.  A Top-Down Approach for Video Summarization , 2014, TOMM.

[19]  Shaohui Mei,et al.  L2,0 constrained sparse dictionary selection for video summarization , 2014, 2014 IEEE International Conference on Multimedia and Expo (ICME).

[20]  Tanaya Guha,et al.  Learning Sparse Representations for Human Action Recognition , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Guillermo Sapiro,et al.  See all by looking at a few: Sparse modeling for finding representative objects , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Chong-Wah Ngo,et al.  Unified entity search in social media community , 2013, WWW.

[23]  Sung Wook Baik,et al.  Adaptive key frame extraction for video summarization using an aggregation mechanism , 2012, J. Vis. Commun. Image Represent..

[24]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[25]  Tao Mei,et al.  Video Captioning with Transferred Semantic Attributes , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Guillermo Sapiro,et al.  Non-local sparse models for image restoration , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[27]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, CVPR.

[28]  Jiebo Luo,et al.  Towards Scalable Summarization of Consumer Videos Via Sparse Dictionary Selection , 2012, IEEE Transactions on Multimedia.

[29]  Guillermo Sapiro,et al.  Supervised Dictionary Learning , 2008, NIPS.

[30]  Arnaldo de Albuquerque Araújo,et al.  VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method , 2011, Pattern Recognit. Lett..

[31]  David S. Doermann,et al.  Video summarization by curve simplification , 1998, MULTIMEDIA '98.

[32]  Chong-Wah Ngo,et al.  Semi-supervised Domain Adaptation with Subspace Learning for visual recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Nam Ik Cho,et al.  A static video summarization method based on the sparse coding of features and representativeness of frames , 2017, EURASIP J. Image Video Process..

[34]  Antonio Bandera,et al.  Spatio-temporal feature-based keyframe detection from video shots using spectral clustering , 2013, Pattern Recognit. Lett..

[35]  Allen Y. Yang,et al.  Robust Face Recognition via Sparse Representation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  M. Yuan,et al.  Model selection and estimation in the Gaussian graphical model , 2007 .

[37]  Lei Zhang,et al.  Multi-label sparse coding for automatic image annotation , 2009, CVPR.

[38]  Tao Mei,et al.  Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[40]  Wei Zheng,et al.  Shot Boundary Detection and Keyframe Extraction Based on Scale Invariant Feature Transform , 2009, 2009 Eighth IEEE/ACIS International Conference on Computer and Information Science.

[41]  Hong Cheng,et al.  Sparsity induced similarity measure for label propagation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[42]  Shiyang Lu,et al.  Keypoint-Based Keyframe Selection , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[43]  Harry W. Agius,et al.  Video summarisation: A conceptual framework and survey of the state of the art , 2008, J. Vis. Commun. Image Represent..

[44]  Tao Mei,et al.  Learning Deep Intrinsic Video Representation by Exploring Temporal Coherence and Graph Structure , 2016, IJCAI.

[45]  Chun Chen,et al.  Graph Regularized Sparse Coding for Image Representation , 2011, IEEE Transactions on Image Processing.

[46]  Licheng Jiao,et al.  Laplacian group sparse modeling of human actions , 2014, Pattern Recognit..

[47]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[48]  Tao Mei,et al.  Deep Quantization: Encoding Convolutional Activations with Deep Generative Model , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Liang-Tien Chia,et al.  Kernel Sparse Representation for Image Classification and Face Recognition , 2010, ECCV.

[50]  Chong-Wah Ngo,et al.  Click-through-based Subspace Learning for Image Search , 2014, ACM Multimedia.