Query-Focused Video Summarization: Dataset, Evaluation, and a Memory Network Based Approach

Recent years have witnessed a resurgence of interest in video summarization. However, one of the main obstacles to the research on video summarization is the user subjectivity — users have various preferences over the summaries. The subjectiveness causes at least two problems. First, no single video summarizer fits all users unless it interacts with and adapts to the individual users. Second, it is very challenging to evaluate the performance of a video summarizer. To tackle the first problem, we explore the recently proposed query-focused video summarization which introduces user preferences in the form of text queries about the video into the summarization process. We propose a memory network parameterized sequential determinantal point process in order to attend the user query onto different video frames and shots. To address the second challenge, we contend that a good evaluation metric for video summarization should focus on the semantic information that humans can perceive rather than the visual features or temporal overlaps. To this end, we collect dense per-video-shot concept annotations, compile a new dataset, and suggest an efficient evaluation method defined upon the concept annotations. We conduct extensive experiments contrasting our video summarizer to existing ones and present detailed analyses about the dataset and the new evaluation method.

[1]  C. Schmid,et al.  Category-Specific Video Summarization , 2014, ECCV.

[2]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[3]  Yale Song,et al.  Video co-summarization: Video summarization by visual co-occurrence , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[6]  Ben Taskar,et al.  Learning Determinantal Point Processes , 2011, UAI.

[7]  David Salesin,et al.  Schematic storyboarding for video visualization and editing , 2006, SIGGRAPH '06.

[8]  James M. Rehg,et al.  Gaze-enabled egocentric video summarization via constrained submodular maximization , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Ulrich Paquet,et al.  Low-Rank Factorization of Determinantal Point Processes , 2016, AAAI.

[10]  Rongrong Ji,et al.  SentiBank: large-scale ontology and classifiers for detecting sentiment and emotions in visual content , 2013, ACM Multimedia.

[11]  Yael Pritch,et al.  Making a Long Video Short: Dynamic Video Synopsis , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[12]  Chih-Jen Lin,et al.  Large-Scale Video Summarization Using Web-Image Priors , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Ben Taskar,et al.  Expectation-Maximization for Learning Determinantal Point Processes , 2014, NIPS.

[14]  Suvrit Sra,et al.  Kronecker Determinantal Point Processes , 2016, NIPS.

[15]  Yejin Choi,et al.  From Large Scale Image Categorization to Entry-Level Categories , 2013, 2013 IEEE International Conference on Computer Vision.

[16]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[17]  Patrick Lambert,et al.  Video summarization from spatio-temporal features , 2008, TVS '08.

[18]  Wayne H. Wolf,et al.  Key frame selection by motion analysis , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[19]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[20]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[21]  Yong Jae Lee,et al.  Predicting Important Objects for Egocentric Video Summarization , 2015, International Journal of Computer Vision.

[22]  Yale Song,et al.  TVSum: Summarizing web videos using titles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Krzysztof Choromanski,et al.  Notes on using Determinantal Point Processes for Clustering with Applications to Text Clustering , 2014, ArXiv.

[24]  Matti Pietikäinen,et al.  Multiresolution Gray-Scale and Rotation Invariant Texture Classification with Local Binary Patterns , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[26]  Ryan P. Adams,et al.  Priors for Diversity in Generative Latent Variable Models , 2012, NIPS.

[27]  Richard Socher,et al.  Dynamic Memory Networks for Visual and Textual Question Answering , 2016, ICML.

[28]  Jason Weston,et al.  Memory Networks , 2014, ICLR.

[29]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[30]  Kristen Grauman,et al.  Diverse Sequential Subset Selection for Supervised Video Summarization , 2014, NIPS.

[31]  Shih-Fu Chang,et al.  Designing Category-Level Attributes for Discriminative Visual Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Ben Taskar,et al.  Determinantal Point Processes for Machine Learning , 2012, Found. Trends Mach. Learn..

[33]  Kristen Grauman,et al.  Story-Driven Summarization for Egocentric Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Bin Zhao,et al.  Quasi Real-Time Summarization for Consumer Videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[35]  Ke Zhang,et al.  Video Summarization with Long Short-Term Memory , 2016, ECCV.

[36]  Tao Mei,et al.  Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  John R. Kender,et al.  Optimization Algorithms for the Selection of Key Frame Sequences of Variable Length , 2002, ECCV.

[38]  Luc Van Gool,et al.  Video summarization by learning submodular mixtures of objectives , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Suvrit Sra,et al.  Fast DPP Sampling for Nystrom with Application to Kernel Methods , 2016, ICML.

[40]  Eric P. Xing,et al.  Joint Summarization of Large-Scale Collections of Web Images and Videos for Storyline Reconstruction , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Suvrit Sra,et al.  Fixed-point algorithms for learning determinantal point processes , 2015, ICML.

[42]  Alex Kulesza,et al.  Diversifying Sparsity Using Variational Determinantal Point Processes , 2014, ArXiv.

[43]  Mubarak Shah,et al.  Query-Focused Extractive Video Summarization , 2016, ECCV.

[44]  Ke Zhang,et al.  Summary Transfer: Exemplar-Based Subset Selection for Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Junseok Kwon,et al.  A unified framework for event summarization and rare event detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Kristen Grauman,et al.  Large-Margin Determinantal Point Processes , 2014, UAI.

[47]  Arnaldo de Albuquerque Araújo,et al.  VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method , 2011, Pattern Recognit. Lett..

[48]  Ben Taskar,et al.  k-DPPs: Fixed-Size Determinantal Point Processes , 2011, ICML.

[49]  Kristen Grauman,et al.  Intentional Photos from an Unintentional Photographer: Detecting Snap Points in Egocentric Video with a Web Photo Prior , 2014, Mobile Cloud Visual Media Computing.

[50]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[51]  Ben Taskar,et al.  Learning the Parameters of Determinantal Point Process Kernels , 2014, ICML.

[52]  Li Fei-Fei,et al.  VideoSET: Video Summary Evaluation through Text , 2014, ArXiv.

[53]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[54]  Jasper Snoek,et al.  A Determinantal Point Process Latent Variable Model for Inhibition in Neural Spiking Data , 2013, NIPS.