A Unified Multi-Faceted Video Summarization System

This paper addresses automatic summarization and search in visual data comprising of videos, live streams and image collections in a unified manner. In particular, we propose a framework for multi-faceted summarization which extracts key-frames (image summaries), skims (video summaries) and entity summaries (summarization at the level of entities like objects, scenes, humans and faces in the video). The user can either view these as extractive summarization, or query focused summarization. Our approach first pre-processes the video or image collection once, to extract all important visual features, following which we provide an interactive mechanism to the user to summarize the video based on their choice. We investigate several diversity, coverage and representation models for all these problems, and argue the utility of these different mod- els depending on the application. While most of the prior work on submodular summarization approaches has focused on combining several models and learning weighted mixtures, we focus on the explain-ability of different the diversity, coverage and representation models and their scalability. Most importantly, we also show that we can summarize hours of video data in a few seconds, and our system allows the user to generate summaries of various lengths and types interactively on the fly.

[1]  Hui Lin,et al.  A Class of Submodular Functions for Document Summarization , 2011, ACL.

[2]  Ke Zhang,et al.  Summary Transfer: Exemplar-Based Subset Selection for Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Anirban Dasgupta,et al.  Summarization Through Submodularity and Dispersion , 2013, ACL.

[4]  Laurence A. Wolsey,et al.  An analysis of the greedy algorithm for the submodular set covering problem , 1982, Comb..

[5]  Rishabh K. Iyer,et al.  Fast Multi-stage Submodular Maximization , 2014, ICML.

[6]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Maxim Sviridenko,et al.  A note on maximizing a submodular set function subject to a knapsack constraint , 2004, Oper. Res. Lett..

[8]  David A. McAllester,et al.  A discriminatively trained, multiscale, deformable part model , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Wayne H. Wolf,et al.  Key frame selection by motion analysis , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[10]  Michel Minoux,et al.  Accelerated greedy algorithms for maximizing submodular set functions , 1978 .

[11]  Richard Szeliski,et al.  First-person hyper-lapse videos , 2014, ACM Trans. Graph..

[12]  Andrew Zisserman,et al.  Deep Face Recognition , 2015, BMVC.

[13]  David Salesin,et al.  Schematic storyboarding for video visualization and editing , 2006, SIGGRAPH '06.

[14]  Hui Lin,et al.  Learning Mixtures of Submodular Shells with Application to Document Summarization , 2012, UAI.

[15]  Hui Lin,et al.  Multi-document Summarization via Budgeted Maximization of Submodular Functions , 2010, NAACL.

[16]  Rishabh K. Iyer,et al.  Learning Mixtures of Submodular Functions for Image Collection Summarization , 2014, NIPS.

[17]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Alex Zelinsky,et al.  Learning OpenCV---Computer Vision with the OpenCV Library (Bradski, G.R. et al.; 2008)[On the Shelf] , 2009, IEEE Robotics & Automation Magazine.

[19]  Bolei Zhou,et al.  Learning Deep Features for Scene Recognition using Places Database , 2014, NIPS.

[20]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Yael Pritch,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2008 1 Non-Chronological Video , 2022 .

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Andreas Krause,et al.  Optimizing sensing: theory and applications , 2008 .

[24]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[25]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[26]  Kristen Grauman,et al.  Diverse Sequential Subset Selection for Supervised Video Summarization , 2014, NIPS.

[27]  Eric P. Xing,et al.  Joint Summarization of Large-Scale Collections of Web Images and Videos for Storyline Reconstruction , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Ali Farhadi,et al.  Salient Montages from Unconstrained Videos , 2014, ECCV.

[29]  Rishabh K. Iyer,et al.  Submodular Optimization with Submodular Cover and Submodular Knapsack Constraints , 2013, NIPS.

[30]  Kristen Grauman,et al.  Story-Driven Summarization for Egocentric Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Bin Zhao,et al.  Quasi Real-Time Summarization for Consumer Videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Bernard Mérialdo,et al.  Multi-video summarization based on Video-MMR , 2010, 11th International Workshop on Image Analysis for Multimedia Interactive Services WIAMIS 10.

[33]  Junsong Yuan,et al.  From Keyframes to Key Objects: Video Summarization by Representative Object Proposal Selection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Pinaki Sinha,et al.  Extractive summarization of personal photos from life events , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[35]  Chih-Jen Lin,et al.  Large-Scale Video Summarization Using Web-Image Priors , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Michael J. Swain,et al.  Color indexing , 1991, International Journal of Computer Vision.

[37]  Tal Hassner,et al.  Age and gender classification using convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[38]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[39]  Rishabh K. Iyer,et al.  Submodularity in Data Subset Selection and Active Learning , 2015, ICML.

[40]  Steven M. Seitz,et al.  Scene Summarization for Online Image Collections , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[41]  Luc Van Gool,et al.  Video summarization by learning submodular mixtures of objectives , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Yale Song,et al.  Video2GIF: Automatic Generation of Animated GIFs from Video , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[44]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).