Towards Scalable Summarization of Consumer Videos Via Sparse Dictionary Selection

The rapid growth of consumer videos requires an effective and efficient content summarization method to provide a user-friendly way to manage and browse the huge amount of video data. Compared with most previous methods that focus on sports and news videos, the summarization of personal videos is more challenging because of its unconstrained content and the lack of any pre-imposed video structures. We formulate video summarization as a novel dictionary selection problem using sparsity consistency, where a dictionary of key frames is selected such that the original video can be best reconstructed from this representative dictionary. An efficient global optimization algorithm is introduced to solve the dictionary selection model with the convergence rates as O(1/K2) (where K is the iteration counter), in contrast to traditional sub-gradient descent methods of O(1/√K). Our method provides a scalable solution for both key frame extraction and video skim generation, because one can select an arbitrary number of key frames to represent the original videos. Experiments on a human labeled benchmark dataset and comparisons to the state-of-the-art methods demonstrate the advantages of our algorithm.

[1]  Guizhong Liu,et al.  A Multiple Visual Models Based Perceptive Analysis Framework for Multilevel Video Summarization , 2007, IEEE Transactions on Circuits and Systems for Video Technology.

[2]  Harry W. Agius,et al.  Video summarisation: A conceptual framework and survey of the state of the art , 2008, J. Vis. Commun. Image Represent..

[3]  Chiou-Ting Hsu,et al.  Fusion of audio and motion information on HMM-based highlight extraction for baseball games , 2006, IEEE Transactions on Multimedia.

[4]  Mubarak Shah,et al.  Detection and representation of scenes in videos , 2005, IEEE Transactions on Multimedia.

[5]  Junsong Yuan,et al.  Sparse reconstruction cost for abnormal event detection , 2011, CVPR 2011.

[6]  Yuxin Peng,et al.  Clip-based similarity measure for query-dependent clip retrieval and video summarization , 2006, IEEE Trans. Circuits Syst. Video Technol..

[7]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[8]  Tong Zhang Intelligent keyframe extraction for video printing , 2004, SPIE Optics East.

[9]  Wei-Ying Ma,et al.  Video summarization based on user log enhanced link analysis , 2003, ACM Multimedia.

[10]  Yueting Zhuang,et al.  Adaptive key frame extraction using unsupervised clustering , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[11]  Jiebo Luo,et al.  Kodak consumer video benchmark data set : concept definition and annotation * * , 2008 .

[12]  Jieping Ye,et al.  An accelerated gradient method for trace norm minimization , 2009, ICML '09.

[13]  Jieping Ye,et al.  Tensor Completion for Estimating Missing Values in Visual Data , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  James M. Rehg,et al.  Visual Place Categorization: Problem, dataset, and algorithm , 2009, 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[15]  Ke Huang,et al.  Sparse Representation for Signal Classification , 2006, NIPS.

[16]  James M. Rehg,et al.  CENTRIST: A Visual Descriptor for Scene Categorization , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Jiebo Luo,et al.  Towards Extracting Semantically Meaningful Key Frames From Personal Video Clips: From Humans to Computers , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[18]  Zhi-Hua Zhou,et al.  Multi-View Video Summarization , 2010, IEEE Transactions on Multimedia.

[19]  Y. Nesterov Gradient methods for minimizing composite objective function , 2007 .

[20]  Anoop Gupta,et al.  Automatically extracting highlights for TV Baseball programs , 2000, ACM Multimedia.

[21]  Ba Tu Truong,et al.  Video abstraction: A systematic review and classification , 2007, TOMCCAP.

[22]  Aggelos K. Katsaggelos,et al.  MINMAX optimal video summarization , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[23]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[24]  Aggelos K. Katsaggelos,et al.  Rate-distortion optimal video summary generation , 2005, IEEE Transactions on Image Processing.

[25]  Tianming Liu,et al.  A novel video key-frame-extraction algorithm based on perceived motion energy model , 2003, IEEE Trans. Circuits Syst. Video Technol..

[26]  Ziyou Xiong,et al.  Effective and efficient sports highlights extraction using the minimum description length criterion in selecting GMM structures [audio classification] , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[27]  John R. Kender,et al.  On the structure and analysis of home videos , 2000 .

[28]  Markus A. Stricker,et al.  Similarity of color images , 1995, Electronic Imaging.

[29]  Baoxin Li,et al.  Semantic sports video analysis: approaches and new applications , 2003, Proceedings 2003 International Conference on Image Processing (Cat. No.03CH37429).

[30]  John R. Kender,et al.  Video summaries and cross-referencing through mosaic-based representation , 2004, Comput. Vis. Image Underst..

[31]  Changsheng Xu,et al.  Automatic music classification and summarization , 2005, IEEE Transactions on Speech and Audio Processing.

[32]  Janusz Konrad,et al.  Video Condensation by Ribbon Carving , 2009, IEEE Transactions on Image Processing.