Video Summarization With Attention-Based Encoder–Decoder Networks

This paper addresses the problem of supervised video summarization by formulating it as a sequence-to-sequence learning problem, where the input is a sequence of original video frames, and the output is a keyshot sequence. Our key idea is to learn a deep summarization network with attention mechanism to mimic the way of selecting the keyshots of human. To this end, we propose a novel video summarization framework named attentive encoder–decoder networks for video summarization (AVS), in which the encoder uses a bidirectional long short-term memory (BiLSTM) to encode the contextual information among the input video frames. As for the decoder, two attention-based LSTM networks are explored by using additive and multiplicative objective functions, respectively. Extensive experiments are conducted on two video summarization benchmark datasets, i.e., SumMe and TVSum. The results demonstrate the superiority of the proposed AVS-based approaches against the state-of-the-art approaches, with remarkable improvements on both datasets.

[1]  Xuelong Li,et al.  Two-stage local constrained sparse coding for fine-grained visual categorization , 2017, Science China Information Sciences.

[2]  Chih-Jen Lin,et al.  Large-Scale Video Summarization Using Web-Image Priors , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Meng Wang,et al.  Coherent Semantic-Visual Indexing for Large-Scale Image Retrieval in the Cloud , 2017, IEEE Transactions on Image Processing.

[4]  Ke Zhang,et al.  Retrospective Encoders for Video Summarization , 2018, ECCV.

[5]  Patrick Gros,et al.  Automatically Creating Adaptive Video Summaries Using Constraint Satisfaction Programming: Application to Sport Content , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[6]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[7]  Xiaoqiang Lu,et al.  Learning deep event models for crowd anomaly detection , 2017, Neurocomputing.

[8]  Bin Zhao,et al.  HSA-RNN: Hierarchical Structure-Adaptive RNN for Video Summarization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[10]  Xuelong Li,et al.  Cascade Learning by Optimally Partitioning , 2015, IEEE Transactions on Cybernetics.

[11]  Ke Zhang,et al.  Summary Transfer: Exemplar-Based Subset Selection for Video Summarization , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Sung Wook Baik,et al.  Efficient visual attention based framework for extracting key frames from videos , 2013, Signal Process. Image Commun..

[13]  Xinlei Chen,et al.  Learning Visual Storylines with Skipping Recurrent Neural Networks , 2016, ECCV.

[14]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[15]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[16]  Ioannis Pitas,et al.  Information theory-based shot cut/fade detection and video summarization , 2006, IEEE Transactions on Circuits and Systems for Video Technology.

[17]  Chong-Wah Ngo,et al.  Video summarization and scene detection by graph modeling , 2005, IEEE Transactions on Circuits and Systems for Video Technology.

[18]  Yang Wang,et al.  Video Summarization Using Fully Convolutional Sequence Networks , 2018, ECCV.

[19]  Tianbao Yang,et al.  Improving Sequential Determinantal Point Processes for Supervised Video Summarization , 2018, ECCV.

[20]  J. Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM networks , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[21]  Michael Lam,et al.  Unsupervised Video Summarization with Adversarial LSTM Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[23]  Yale Song,et al.  TVSum: Summarizing web videos using titles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Shaohui Mei,et al.  Video summarization via minimum sparse reconstruction , 2015, Pattern Recognit..

[25]  Xuelong Li,et al.  Learning Multilayer Channel Features for Pedestrian Detection , 2016, IEEE Transactions on Image Processing.

[26]  Ke Zhang,et al.  Video Summarization with Long Short-Term Memory , 2016, ECCV.

[27]  C. Schmid,et al.  Category-Specific Video Summarization , 2014, ECCV.

[28]  Yong Jae Lee,et al.  Discovering important people and objects for egocentric video summarization , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Xuelong Li,et al.  A General Framework for Edited Video and Raw Video Summarization , 2017, IEEE Transactions on Image Processing.

[30]  Ba Tu Truong,et al.  Video abstraction: A systematic review and classification , 2007, TOMCCAP.

[31]  Xuelong Li,et al.  Query-Aware Sparse Coding for Multi-Video Summarization , 2017, ArXiv.

[32]  Gunhee Kim,et al.  A Memory Network Approach for Story-Based Temporal Summarization of 360° Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Kristen Grauman,et al.  Diverse Sequential Subset Selection for Supervised Video Summarization , 2014, NIPS.

[34]  Chiranjib Bhattacharyya,et al.  Bayesian Modeling of Temporal Coherence in Videos for Entity Discovery and Summarization , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Ananda S. Chowdhury,et al.  Video key frame extraction through dynamic Delaunay clustering with a structural constraint , 2013, J. Vis. Commun. Image Represent..

[36]  Minyi Guo,et al.  Unsupervised Extraction of Video Highlights via Robust Recurrent Auto-Encoders , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[37]  Jiebo Luo,et al.  Towards Scalable Summarization of Consumer Videos Via Sparse Dictionary Selection , 2012, IEEE Transactions on Multimedia.

[38]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Meng Wang,et al.  Event Driven Web Video Summarization by Tag Localization and Key-Shot Identification , 2012, IEEE Transactions on Multimedia.

[40]  Harry W. Agius,et al.  Video summarisation: A conceptual framework and survey of the state of the art , 2008, J. Vis. Commun. Image Represent..

[41]  Yue Wang,et al.  Motion-State-Adaptive Video Summarization via Spatiotemporal Analysis , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[42]  Tianbao Yang,et al.  How Local is the Local Diversity? Reinforcing Sequential Determinantal Point Processes with Dynamic Ground Sets for Supervised Video Summarization , 2018, ECCV.

[43]  Yanwei Pang,et al.  GlanceNets — efficient convolutional neural networks with adaptive hard example mining , 2018, Science China Information Sciences.

[44]  Nicu Sebe,et al.  Perceptual Attributes Optimization for Multivideo Summarization , 2016, IEEE Transactions on Cybernetics.

[45]  Arnaldo de Albuquerque Araújo,et al.  VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method , 2011, Pattern Recognit. Lett..

[46]  Luc Van Gool,et al.  Video summarization by learning submodular mixtures of objectives , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Wei Liu,et al.  Pairwise Relationship Guided Deep Hashing for Cross-Modal Retrieval , 2017, AAAI.

[48]  Peter Kulchyski and , 2015 .

[49]  Zhi-Hua Zhou,et al.  Multi-View Video Summarization , 2010, IEEE Transactions on Multimedia.

[50]  Shaogang Gong,et al.  Discovery of Shared Semantic Spaces for Multiscene Video Query and Summarization , 2015, IEEE Transactions on Circuits and Systems for Video Technology.