Video Summarization via Semantic Attended Networks

The goal of video summarization is to distill a raw video into a more compact form without losing much semantic information. However, previous methods mainly consider the diversity and representation interestingness of the obtained summary, and they seldom pay sufficient attention to semantic information of resulting frame set, especially the long temporal range semantics. To explicitly address this issue, we propose a novel technique which is able to extract the most semantically relevant video segments (i.e., valid for a long term temporal duration) and assemble them into an informative summary. To this end, we develop a semantic attended video summarization network (SASUM) which consists of a frame selector and video descriptor to select an appropriate number of video shots by minimizing the distance between the generated description sentence of the summarized video and the human annotated text of the original video. Extensive experiments show that our method achieves a superior performance gain over previous methods on two benchmark datasets.

[1]  Patrick Lambert,et al.  Video summarization from spatio-temporal features , 2008, TVS '08.

[2]  C. Schmid,et al.  Category-Specific Video Summarization , 2014, ECCV.

[3]  Michael Lam,et al.  Unsupervised Video Summarization with Adversarial LSTM Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Bingbing Ni,et al.  Unsupervised Deep Learning for Optical Flow Estimation , 2017, AAAI.

[5]  Yale Song,et al.  Video co-summarization: Video summarization by visual co-occurrence , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Li Fei-Fei,et al.  VideoSET: Video Summary Evaluation through Text , 2014, ArXiv.

[7]  Chih-Jen Lin,et al.  Large-Scale Video Summarization Using Web-Image Priors , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Mubarak Shah,et al.  Query-Focused Extractive Video Summarization , 2016, ECCV.

[9]  Ke Zhang,et al.  Summary Transfer: Exemplar-Based Subset Selection for Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Yale Song,et al.  TVSum: Summarizing web videos using titles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[12]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Yang Yang,et al.  Bidirectional Long-Short Term Memory for Video Description , 2016, ACM Multimedia.

[14]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Tae-Hyun Oh,et al.  Textually Customized Video Summaries , 2017, ArXiv.

[16]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[17]  Xuelong Li,et al.  Video Summarization With Attention-Based Encoder–Decoder Networks , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[18]  Bin Zhao,et al.  Quasi Real-Time Summarization for Consumer Videos , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Ke Zhang,et al.  Video Summarization with Long Short-Term Memory , 2016, ECCV.

[20]  Leonid Sigal,et al.  Learning Language-Visual Embedding for Movie Understanding with Natural-Language , 2016, ArXiv.

[21]  Kate Saenko,et al.  Top-Down Visual Saliency Guided by Captions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[24]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[25]  Arnaldo de Albuquerque Araújo,et al.  VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method , 2011, Pattern Recognit. Lett..

[26]  Christopher Joseph Pal,et al.  Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research , 2015, ArXiv.

[27]  Kristen Grauman,et al.  Diverse Sequential Subset Selection for Supervised Video Summarization , 2014, NIPS.

[28]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[29]  Lin Ma,et al.  Multimodal Convolutional Neural Networks for Matching Image and Sentence , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[30]  Luc Van Gool,et al.  Video summarization by learning submodular mixtures of objectives , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Youness Tabii,et al.  Video Summarization: Techniques and Applications , 2015 .

[32]  Svetlana Lazebnik,et al.  Enhancing Video Summarization via Vision-Language Embedding , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).