Keyframe-based Video Summarization with Human in the Loop

In this work, we focus on the popular keyframe-based approach for video summarization. Keyframes represent important and diverse content of an input video and a summary is generated by temporally expanding the keyframes to key shots which are merged to a continuous dynamic video summary. In our approach, keyframes are selected from scenes that represent semantically similar content. For scene detection, we propose a simple yet effective dynamic extension of a video Bag-of-Words (BoW) method which provides over segmentation (high recall) for keyframe selection. For keyframe selection, we investigate two effective approaches: local region descriptors (visual content) and optical flow descriptors (motion content). We provide several interesting findings. 1) While scenes (visually similar content) can be effectively detected by region descriptors, optical flow (motion changes) provides better keyframes. 2) However, the suitable parameters of the motion descriptor based keyframe selection vary from one video to another and average performances remain low. To avoid more complex processing, we introduce a human-in-the-loop step where user selects keyframes produced by the three best methods. 3) Our human assisted and learning-free method achieves superior accuracy to learning-based methods and for many videos is on par with average human accuracy.

[1]  Junsong Yuan,et al.  From Keyframes to Key Objects: Video Summarization by Representative Object Proposal Selection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[3]  José María Martínez Sanchez,et al.  Scalable Comic-Like Video Summaries and Layout Disturbance , 2012, IEEE Transactions on Multimedia.

[4]  Georges Quénot,et al.  TRECVID 2015 - An Overview of the Goals, Tasks, Data, Evaluation Mechanisms and Metrics , 2011, TRECVID.

[5]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[6]  C. Schmid,et al.  Category-Specific Video Summarization , 2014, ECCV.

[7]  Luc Van Gool,et al.  Video summarization by learning submodular mixtures of objectives , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Lie Lu,et al.  A generic framework of user attention model and its application in video summarization , 2005, IEEE Trans. Multim..

[9]  Alexander C. Loui,et al.  Finding structure in home videos by probabilistic hierarchical clustering , 2003, IEEE Trans. Circuits Syst. Video Technol..

[10]  Bo Han,et al.  Video scene segmentation using a novel boundary evaluation criterion and dynamic programming , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[11]  Shaogang Gong,et al.  Video Synopsis by Heterogeneous Multi-source Correlation , 2013, 2013 IEEE International Conference on Computer Vision.

[12]  Paul Over,et al.  Video shot boundary detection: Seven years of TRECVid activity , 2010, Comput. Vis. Image Underst..

[13]  Hui Cheng,et al.  Evaluation of low-level features and their combinations for complex event detection in open source videos , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Bin Zhao,et al.  Quasi Real-Time Summarization for Consumer Videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Yang Yang,et al.  Multimedia Summarization for Social Events in Microblog Stream , 2015, IEEE Transactions on Multimedia.

[16]  Kristen Grauman,et al.  Diverse Sequential Subset Selection for Supervised Video Summarization , 2014, NIPS.

[17]  Tao Mei,et al.  A Bag-of-Importance Model With Locality-Constrained Coding Based Feature Learning for Video Summarization , 2014, IEEE Transactions on Multimedia.

[18]  Joni-Kristian Kämäräinen,et al.  Video Shot Boundary Detection using Visual Bag-of-Words , 2013, VISAPP.

[19]  Ba Tu Truong,et al.  Video abstraction: A systematic review and classification , 2007, TOMCCAP.

[20]  Bohyung Han,et al.  Personalized video summarization with human in the loop , 2011, 2011 IEEE Workshop on Applications of Computer Vision (WACV).

[21]  Ke Zhang,et al.  Summary Transfer: Exemplar-Based Subset Selection for Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Tao Mei,et al.  EMS: Energy Minimization Based Video Scene Segmentation , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[23]  Jean-Marc Odobez,et al.  On Spectral Methods and the Structuring of Home Videos , 2002 .

[24]  Yongdong Zhang,et al.  Multi-task deep visual-semantic embedding for video thumbnail selection , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Shiyang Lu,et al.  Keypoint-Based Keyframe Selection , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[26]  Meng Wang,et al.  Movie2Comics: a feast of multimedia artwork , 2010, ACM Multimedia.

[27]  Meng Wang,et al.  Event Driven Web Video Summarization by Tag Localization and Key-Shot Identification , 2012, IEEE Transactions on Multimedia.

[28]  Shih-Fu Chang,et al.  Consumer video understanding: a benchmark database and an evaluation of human and machine performance , 2011, ICMR.

[29]  László Böszörményi,et al.  State-of-the-art and future challenges in video scene detection: a survey , 2013, Multimedia Systems.

[30]  Ioannis Pitas,et al.  Enhanced Eigen-Audioframes for Audiovisual Scene Change Detection , 2007, IEEE Transactions on Multimedia.

[31]  Yael Pritch,et al.  Making a Long Video Short: Dynamic Video Synopsis , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[32]  Rama Chellappa,et al.  Video Précis: Highlighting Diverse Aspects of Videos , 2010, IEEE Transactions on Multimedia.

[33]  Jiebo Luo,et al.  Kodak consumer video benchmark data set : concept definition and annotation * * , 2008 .

[34]  Chong-Wah Ngo,et al.  Video summarization and scene detection by graph modeling , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[35]  Arnaldo de Albuquerque Araújo,et al.  VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method , 2011, Pattern Recognit. Lett..

[36]  Hongyang Chao,et al.  Discovering Video Shot Categories by Unsupervised Stochastic Graph Partition , 2013, IEEE Transactions on Multimedia.

[37]  Sung Wook Baik,et al.  Efficient visual attention based framework for extracting key frames from videos , 2013, Signal Process. Image Commun..

[38]  Gunnar Farnebäck,et al.  Two-Frame Motion Estimation Based on Polynomial Expansion , 2003, SCIA.

[39]  Jiebo Luo,et al.  Towards Extracting Semantically Meaningful Key Frames From Personal Video Clips: From Humans to Computers , 2009, IEEE Transactions on Circuits and Systems for Video Technology.

[40]  R. Vidal,et al.  Histograms of oriented optical flow and Binet-Cauchy kernels on nonlinear dynamical systems for the recognition of human actions , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Yale Song,et al.  Video2GIF: Automatic Generation of Animated GIFs from Video , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Mubarak Shah,et al.  Video scene segmentation using Markov chain Monte Carlo , 2006, IEEE Transactions on Multimedia.