Summarizing Videos with Attention

In this work we propose a novel method for supervised, keyshots based video summarization by applying a conceptually simple and computationally efficient soft, self-attention mechanism. Current state of the art methods leverage bi-directional recurrent networks such as BiLSTM combined with attention. These networks are complex to implement and computationally demanding compared to fully connected networks. To that end we propose a simple, self-attention based network for video summarization which performs the entire sequence to sequence transformation in a single feed forward pass and single backward pass during training. Our method sets a new state of the art results on two benchmarks TvSum and SumMe, commonly used in this domain.

[1]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[2]  Arnaldo de Albuquerque Araújo,et al.  VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method , 2011, Pattern Recognit. Lett..

[3]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[4]  Luc Van Gool,et al.  Video summarization by learning submodular mixtures of objectives , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[6]  Bingbing Ni,et al.  Video Summarization via Semantic Attended Networks , 2018, AAAI.

[7]  Vasileios Argyriou,et al.  Sub-Hexagonal Phase Correlation for Motion Estimation , 2011, IEEE Transactions on Image Processing.

[8]  Kieran G. Larkin,et al.  Reflections on Shannon Information: In search of a natural information-entropy for images , 2016, ArXiv.

[9]  Yale Song,et al.  TVSum: Summarizing web videos using titles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Wei Jiang,et al.  Memorable and rich video summarization , 2017, J. Vis. Commun. Image Represent..

[11]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  C. Schmid,et al.  Category-Specific Video Summarization , 2014, ECCV.

[13]  Bowen Zhou,et al.  A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[14]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[15]  Yang Wang,et al.  Learning Video Summarization Using Unpaired Data , 2018, ArXiv.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[18]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[19]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[20]  Xuelong Li,et al.  Video Summarization With Attention-Based Encoder–Decoder Networks , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[21]  Ke Zhang,et al.  Video Summarization with Long Short-Term Memory , 2016, ECCV.

[22]  Paolo Remagnino,et al.  AMNet: Memorability Estimation with Attention , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[24]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[25]  Sergio Gomez Colmenarejo,et al.  Hybrid computing using a neural network with dynamic external memory , 2016, Nature.

[26]  Antonio Torralba,et al.  Understanding and Predicting Image Memorability at a Large Scale , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[27]  Michael Lam,et al.  Unsupervised Video Summarization with Adversarial LSTM Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Xuelong Li,et al.  Hierarchical Recurrent Neural Network for Video Summarization , 2017, ACM Multimedia.

[29]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[30]  Keegan Kang,et al.  Feature Representation in Convolutional Neural Networks , 2015, ArXiv.

[31]  Tao Mei,et al.  Video Summarization by Learning Deep Side Semantic Embedding , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[32]  Kaiyang Zhou,et al.  Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward , 2017, AAAI.

[33]  Yang Wang,et al.  Video Summarization by Learning From Unpaired Data , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Naokazu Yokoya,et al.  Video Summarization Using Deep Semantic Features , 2016, ACCV.

[35]  Jakob Uszkoreit,et al.  A Decomposable Attention Model for Natural Language Inference , 2016, EMNLP.

[36]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[37]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Silvio Jamil Ferzoli Guimarães,et al.  Summarizing video sequence using a graph-based hierarchical approach , 2016, Neurocomputing.

[39]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[40]  Mirella Lapata,et al.  Long Short-Term Memory-Networks for Machine Reading , 2016, EMNLP.

[41]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[42]  Steven A. Shafer,et al.  Anatomy of a color histogram , 1992, Proceedings 1992 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.