论文信息 - Query-adaptive Video Summarization via Quality-aware Relevance Estimation

Query-adaptive Video Summarization via Quality-aware Relevance Estimation

Although the problem of automatic video summarization has recently received a lot of attention, the problem of creating a video summary that also highlights elements relevant to a search query has been less studied. We address this problem by posing query-relevant summarization as a video frame subset selection problem, which lets us optimise for summaries which are simultaneously diverse, representative of the entire video, and relevant to a text query. We quantify relevance by measuring the distance between frames and queries in a common textual-visual semantic embedding space induced by a neural network. In addition, we extend the model to capture query-independent properties, such as frame quality. We compare our method against previous state of the art on textual-visual embeddings for thumbnail selection and show that our model outperforms them on relevance prediction. Furthermore, we introduce a new dataset, annotated with diversity and query-specific relevance labels. On this dataset, we train and test our complete model for video summarization and show that it outperforms standard baselines such as Maximal Marginal Relevance.

[1] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2] Arnaldo de Albuquerque Araújo,et al. VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method , 2011, Pattern Recognit. Lett..

[3] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[4] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[5] Sven J. Dickinson,et al. Video In Sentences Out , 2012, UAI.

[6] Kristen Grauman,et al. Story-Driven Summarization for Egocentric Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[7] Bin Zhao,et al. Quasi Real-Time Summarization for Consumer Videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[8] Jung Hwan Oh,et al. Video Abstraction , 2009, Encyclopedia of Database Systems.

[9] Ruslan Salakhutdinov,et al. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[10] Ali Farhadi,et al. Ranking Domain-Specific Highlights by Analyzing Edited Videos , 2014, ECCV.

[11] Ke Zhang,et al. Video Summarization with Long Short-Term Memory , 2016, ECCV.

[12] Hailin Jin,et al. Composition-Preserving Deep Photo Aesthetics Assessment , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Samy Bengio,et al. Zero-Shot Learning by Convex Combination of Semantic Embeddings , 2013, ICLR.

[14] Kristen Grauman,et al. Intentional Photos from an Unintentional Photographer: Detecting Snap Points in Egocentric Video with a Web Photo Prior , 2014, Mobile Cloud Visual Media Computing.

[15] Yale Song,et al. TVSum: Summarizing web videos using titles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Xirong Li,et al. TagBook: A Semantic Video Representation Without Supervision for Event Detection , 2015, IEEE Transactions on Multimedia.

[17] Yale Song,et al. Video2GIF: Automatic Generation of Animated GIFs from Video , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Quoc V. Le,et al. Grounded Compositional Semantics for Finding and Describing Images with Sentences , 2014, TACL.

[19] Jade Goldstein-Stewart,et al. The use of MMR, diversity-based reranking for reordering documents and producing summaries , 1998, SIGIR '98.

[20] M. L. Fisher,et al. An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[21] Cees Snoek,et al. Video2vec Embeddings Recognize Events When Examples Are Scarce , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[23] Yale Song,et al. To Click or Not To Click: Automatic Selection of Beautiful Thumbnails from Videos , 2016, CIKM.

[24] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[25] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Radomír Mech,et al. Deep Multi-patch Aggregation Network for Image Style, Aesthetics, and Quality Estimation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[27] Ali Farhadi,et al. Semantic highlight retrieval , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[28] Silke Wagner,et al. Comparing Clusterings - An Overview , 2007 .

[29] Xinlei Chen,et al. Mind's eye: A recurrent visual representation for image caption generation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Armand Joulin,et al. Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[31] Luc Van Gool,et al. Video summarization by learning submodular mixtures of objectives , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Ali Farhadi,et al. Semantic Highlight Retrieval and Term Prediction , 2017, IEEE Transactions on Image Processing.

[34] Yaser Sheikh,et al. Automatic editing of footage from multiple social cameras , 2014, ACM Trans. Graph..

[35] Bogdan Ionescu,et al. Retrieving Diverse Social Images at MediaEval 2017: Challenges, Dataset and Evaluation , 2017, MediaEval.

[36] Wayne H. Wolf,et al. Key frame selection by motion analysis , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[37] Yale Song,et al. Mouse Activity as an Indicator of Interestingness in Video , 2016, ICMR.

[38] Richard Szeliski,et al. First-person hyper-lapse videos , 2014, ACM Trans. Graph..

[39] Antonio Torralba,et al. Understanding the Intrinsic Memorability of Images , 2011, NIPS.

[40] Andreas Krause,et al. Submodular Function Maximization , 2014, Tractability.

[41] C. Schmid,et al. Category-Specific Video Summarization , 2014, ECCV.

[42] Yong Jae Lee,et al. Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[43] Wei Xu,et al. Explain Images with Multimodal Recurrent Neural Networks , 2014, ArXiv.

[44] Michael Cohen,et al. First-person Hyperlapse Videos , 2014, SIGGRAPH 2014.

[45] Xinlei Chen,et al. Learning a Recurrent Visual Representation for Image Caption Generation , 2014, ArXiv.

[46] Hui Lin,et al. Learning Mixtures of Submodular Shells with Application to Document Summarization , 2012, UAI.

[47] Yuzhen Niu,et al. Using Web Photos for Measuring Video Frame Interestingness , 2009, IJCAI.

[48] Tao Mei,et al. Highlight Detection with Pairwise Deep Ranking for First-Person Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49] Cees Snoek,et al. Objects2action: Classifying and Localizing Actions without Any Video Example , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[50] Jade Goldstein-Stewart,et al. The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries , 1998, SIGIR Forum.

[51] Mubarak Shah,et al. Query-Focused Extractive Video Summarization , 2016, ECCV.

[52] Jonas Mueller,et al. Siamese Recurrent Architectures for Learning Sentence Similarity , 2016, AAAI.

[53] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[54] Yongdong Zhang,et al. Multi-task deep visual-semantic embedding for video thumbnail selection , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55] Ana L. N. Fred,et al. Robust data clustering , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[56] Chenliang Xu,et al. A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[57] Eric P. Xing,et al. Joint Summarization of Large-Scale Collections of Web Images and Videos for Storyline Reconstruction , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[58] Ali Farhadi,et al. Salient Montages from Unconstrained Videos , 2014, ECCV.

[59] Chih-Jen Lin,et al. Large-Scale Video Summarization Using Web-Image Priors , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[60] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[61] Tao Mei,et al. Correlative multi-label video annotation , 2007, ACM Multimedia.

[62] Thomas Mensink,et al. VideoStory Embeddings Recognize Events when Examples are Scarce , 2015 .

[63] Kristen Grauman,et al. Diverse Sequential Subset Selection for Supervised Video Summarization , 2014, NIPS.

[64] Luc Van Gool,et al. The Interestingness of Images , 2013, 2013 IEEE International Conference on Computer Vision.

[65] Jeff A. Bilmes,et al. A Submodular-supermodular Procedure with Applications to Discriminative Structure Learning , 2005, UAI.

[66] Yael Pritch,et al. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2008 1 Non-Chronological Video , 2022 .

[67] Alberto Del Bimbo,et al. A data-driven approach for tag refinement and localization in web videos , 2015, Comput. Vis. Image Underst..

[68] Radomír Mech,et al. Event-Specific Image Importance , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69] Marina Meila,et al. Comparing Clusterings by the Variation of Information , 2003, COLT.

[70] Michel Minoux,et al. Accelerated greedy algorithms for maximizing submodular set functions , 1978 .

[71] Geoffrey Zweig,et al. From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).