论文信息 - Enhancing Video Summarization via Vision-Language Embedding

Enhancing Video Summarization via Vision-Language Embedding

This paper addresses video summarization, or the problem of distilling a raw video into a shorter form while still capturing the original story. We show that visual representations supervised by freeform language make a good fit for this application by extending a recent submodular summarization approach [9] with representativeness and interestingness objectives computed on features from a joint vision-language embedding space. We perform an evaluation on two diverse datasets, UT Egocentric [18] and TV Episodes [45], and show that our new objectives give improved summarization ability compared to standard visual features alone. Our experiments also show that the vision-language embedding need not be trained on domainspecific data, but can be learned from standard still image vision-language datasets and transferred to video. A further benefit of our model is the ability to guide a summary using freeform text input at test time, allowing user customization.

Svetlana Lazebnik | Matthew A. Brown | Matthew Brown | Bryan A. Plummer | S. Lazebnik

[1] Kristen Grauman,et al. Intentional Photos from an Unintentional Photographer: Detecting Snap Points in Egocentric Video with a Web Photo Prior , 2014, Mobile Cloud Visual Media Computing.

[2] Yong Yu,et al. Video summarization via transferrable structured learning , 2011, WWW.

[3] Francis Ferraro,et al. Visual Storytelling , 2016, NAACL.

[4] G. G. Stokes. "J." , 1890, The New Yale Book of Quotations.

[5] Li Fei-Fei,et al. VideoSET: Video Summary Evaluation through Text , 2014, ArXiv.

[6] C. Schmid,et al. Category-Specific Video Summarization , 2014, ECCV.

[7] Yale Song,et al. Video co-summarization: Video summarization by visual co-occurrence , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Yong Jae Lee,et al. Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[9] Jiebo Luo,et al. Towards Scalable Summarization of Consumer Videos Via Sparse Dictionary Selection , 2012, IEEE Transactions on Multimedia.

[10] Takeo Kanade,et al. Video skimming and characterization through the combination of image and language understanding , 1998, Proceedings 1998 IEEE International Workshop on Content-Based Access of Image and Video Database.

[11] James M. Rehg,et al. Gaze-enabled egocentric video summarization via constrained submodular maximization , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Ba Tu Truong,et al. Video abstraction: A systematic review and classification , 2007, TOMCCAP.

[13] Lin Ma,et al. Multimodal Convolutional Neural Networks for Matching Image and Sentence , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14] Luc Van Gool,et al. Video summarization by learning submodular mixtures of objectives , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Andreas Krause,et al. Budgeted Nonparametric Learning from Data Streams , 2010, ICML.

[17] Michael A. Smith,et al. Video skimming and characterization through the combination of image and language understanding techniques , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18] Chih-Jen Lin,et al. Large-Scale Video Summarization Using Web-Image Priors , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[19] Yale Song,et al. TVSum: Summarizing web videos using titles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Luc Van Gool,et al. Creating Summaries from User Videos , 2014, ECCV.

[21] M. L. Fisher,et al. An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[22] Matthijs C. Dorst. Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[23] Trevor Darrell,et al. DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[24] Yin Li,et al. Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Sanja Fidler,et al. Order-Embeddings of Images and Language , 2015, ICLR.

[26] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Armand Joulin,et al. Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[28] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[29] Lior Wolf,et al. Associating neural word embeddings with deep image representations using Fisher Vectors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Esa Rahtu,et al. Segmenting Salient Objects from Images and Videos , 2010, ECCV.

[31] Meng Wang,et al. Event Driven Web Video Summarization by Tag Localization and Key-Shot Identification , 2012, IEEE Transactions on Multimedia.

[32] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[33] Wei Xu,et al. Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[34] Eric P. Xing,et al. Joint Summarization of Large-Scale Collections of Web Images and Videos for Storyline Reconstruction , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[35] Adel M. Alimi,et al. IM(S)2: Interactive movie summarization system , 2010, J. Vis. Commun. Image Represent..

[36] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[37] Svetlana Lazebnik,et al. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, International Journal of Computer Vision.

[38] Mubarak Shah,et al. Query-Focused Extractive Video Summarization , 2016, ECCV.

[39] Ke Zhang,et al. Summary Transfer: Exemplar-Based Subset Selection for Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Yongdong Zhang,et al. Multi-task deep visual-semantic embedding for video thumbnail selection , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Sanja Fidler,et al. Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[42] Ali Farhadi,et al. Ranking Domain-Specific Highlights by Analyzing Edited Videos , 2014, ECCV.

[43] Lior Wolf,et al. RNN Fisher Vectors for Action Recognition and Image Annotation , 2015, ECCV.

[44] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[45] Kristen Grauman,et al. Diverse Sequential Subset Selection for Supervised Video Summarization , 2014, NIPS.

[46] Gunhee Kim,et al. Storyline Representation of Egocentric Videos with an Applications to Story-Based Search , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[47] Kristen Grauman,et al. Story-Driven Summarization for Egocentric Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[48] Bin Zhao,et al. Quasi Real-Time Summarization for Consumer Videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[49] Junseok Kwon,et al. A unified framework for event summarization and rare event detection , 2012, CVPR.

[50] Peter Young,et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[51] Ke Zhang,et al. Video Summarization with Long Short-Term Memory , 2016, ECCV.

[52] Bernard Mérialdo,et al. Multi-video summarization based on Video-MMR , 2010, 11th International Workshop on Image Analysis for Multimedia Interactive Services WIAMIS 10.