TVSum: Summarizing web videos using titles

Video summarization is a challenging problem in part because knowing which part of a video is important requires prior knowledge about its main topic. We present TVSum, an unsupervised video summarization framework that uses title-based image search results to find visually important shots. We observe that a video title is often carefully chosen to be maximally descriptive of its main topic, and hence images related to the title can serve as a proxy for important visual concepts of the main topic. However, because titles are free-formed, unconstrained, and often written ambiguously, images searched using the title can contain noise (images irrelevant to video content) and variance (images of different topics). To deal with this challenge, we developed a novel co-archetypal analysis technique that learns canonical visual concepts shared between video and images, but not in either alone, by finding a joint-factorial representation of two data sets. We introduce a new benchmark dataset, TVSum50, that contains 50 videos and their shot-level importance scores annotated via crowdsourcing. Experimental results on two datasets, SumMe and TVSum50, suggest our approach produces superior quality summaries compared to several recently proposed approaches.

[1]  C. Ji An Archetypal Analysis on , 2005 .

[2]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[3]  Claudio Carpineto,et al.  A Survey of Automatic Query Expansion in Information Retrieval , 2012, CSUR.

[4]  Sanja Fidler,et al.  Visual Semantic Search: Retrieving Videos via Complex Textual Queries , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Yueting Zhuang,et al.  Adaptive key frame extraction using unsupervised clustering , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[6]  Yael Pritch,et al.  This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2008 1 Non-Chronological Video , 2022 .

[7]  Dong Liu,et al.  Event-Driven Semantic Concept Discovery by Exploiting Weakly Tagged Internet Images , 2014, ICMR.

[8]  Meng Wang,et al.  Event Driven Web Video Summarization by Tag Localization and Key-Shot Identification , 2012, IEEE Transactions on Multimedia.

[9]  Stan Z. Li,et al.  Online content-aware video condensation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Lawrence Wai-Choong Wong,et al.  ANSES: Summarisation of News Video , 2003, CIVR.

[11]  James M. Rehg,et al.  Temporal causality for the analysis of visual events , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[12]  Eric P. Xing,et al.  Joint Summarization of Large-Scale Collections of Web Images and Videos for Storyline Reconstruction , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  N. Ohashi,et al.  Agreement , 2002 .

[14]  Wei-Ta Chu,et al.  Editing by Viewing: Automatic Home Video Summarization by Viewing Behavior Analysis , 2011, IEEE Transactions on Multimedia.

[15]  Shih-Fu Chang,et al.  Mixed image-keyword query adaptive hashing over multilabel images , 2014, TOMCCAP.

[16]  C. Schmid,et al.  Category-Specific Video Summarization , 2014, ECCV.

[17]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[18]  Dong Xu,et al.  Exploiting web images for event recognition in consumer videos: A multiple source domain adaptation approach , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  A. Murat Tekalp,et al.  Automatic soccer video analysis and summarization , 2003, IEEE Trans. Image Process..

[20]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Fei-Fei Li,et al.  Learning latent temporal structure for complex event detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  P. Tseng Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization , 2001 .

[23]  Andrew Zisserman,et al.  Representing shape with a spatial pyramid kernel , 2007, CIVR '07.

[24]  Ba Tu Truong,et al.  Video abstraction: A systematic review and classification , 2007, TOMCCAP.

[25]  Chih-Jen Lin,et al.  Large-Scale Video Summarization Using Web-Image Priors , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Luc Van Gool,et al.  The Interestingness of Images , 2013, 2013 IEEE International Conference on Computer Vision.

[27]  Zaïd Harchaoui,et al.  Fast and Robust Archetypal Analysis for Representation Learning , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Yong Yu,et al.  Video summarization via transferrable structured learning , 2011, WWW.

[29]  Paul Over,et al.  Video shot boundary detection: Seven years of TRECVid activity , 2010, Comput. Vis. Image Underst..

[30]  Z. Harchaoui,et al.  Multiple Change-Point Estimation With a Total Variation Penalty , 2010 .

[31]  Sung Wook Baik,et al.  Efficient visual attention based framework for extracting key frames from videos , 2013, Signal Process. Image Commun..

[32]  Gang Hua,et al.  A Hierarchical Visual Model for Video Object Summarization , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Christophe De Vleeschouwer,et al.  Formulating Team-Sport Video Summarization as a Resource Allocation Problem , 2011, IEEE Transactions on Circuits and Systems for Video Technology.

[34]  Michèle Basseville,et al.  Detection of abrupt changes: theory and application , 1993 .

[35]  Jean-Philippe Vert,et al.  The group fused Lasso for multiple change-point detection , 2011, 1106.4199.

[36]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[37]  Dian Tjondronegoro,et al.  Integrating Highlights for More Complete Sports Video Summarization , 2004 .

[38]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[39]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[40]  Paul Over,et al.  Evaluation campaigns and TRECVid , 2006, MIR '06.

[41]  Shuicheng Yan,et al.  Image tag refinement towards low-rank, content-tag prior and error sparsity , 2010, ACM Multimedia.

[42]  Ali Farhadi,et al.  Ranking Domain-Specific Highlights by Analyzing Edited Videos , 2014, ECCV.

[43]  V. Berger Selection Bias and Covariate Imbalances in Randomized Clinical Trials: Berger/Selection Bias and Covariate Imbalances in Randomized Clinical Trials , 2005 .

[44]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[45]  Yi-Ping Phoebe Chen,et al.  Highlights for more complete sports video summarization , 2004, IEEE MultiMedia.

[46]  Sanja Fidler,et al.  A Sentence Is Worth a Thousand Pixels , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[47]  Kristen Grauman,et al.  Story-Driven Summarization for Egocentric Video , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Bin Zhao,et al.  Quasi Real-Time Summarization for Consumer Videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Ali Borji,et al.  State-of-the-Art in Visual Attention Modeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50]  Ali Farhadi,et al.  Learning Everything about Anything: Webly-Supervised Visual Concept Learning , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[51]  Jiebo Luo,et al.  Towards Scalable Summarization of Consumer Videos Via Sparse Dictionary Selection , 2012, IEEE Transactions on Multimedia.

[52]  Andrew Zisserman,et al.  Image Classification using Random Forests and Ferns , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[53]  George Hripcsak,et al.  Technical Brief: Agreement, the F-Measure, and Reliability in Information Retrieval , 2005, J. Am. Medical Informatics Assoc..

[54]  Thomas L. Griffiths,et al.  Visual Concept Learning: Combining Machine Vision and Bayesian Generalization on Concept Hierarchies , 2013, NIPS.