Video Summarization with Anchors and Multi-Head Attention

Video summarization is a challenging task that will automatically generate a representative and attractive highlight movie from the source video. Previous works explicitly exploit the hierarchical structure of video to train a summarizer. However, their method sometimes uses fixed-length segmentation, which breaks the video structure or requires additional training data to train the segmentation model. In this paper, we propose an Anchor-Based Attention RNN (ABA-RNN) for solving the video summarization problem. ABA-RNN provides two contributions. One is that we attain the frame-level and clip-level features by the anchor-based approach, and the model only needs one layer of RNN by introducing subtraction manner used in minus-LSTM. We also use multi-head attention to let the model select suitable lengths of segments. Another contribution is that we do not need any extra video preprocessing to determine shot boundaries and our architecture is end-to-end training. In experiments, we follow the standard datasets SumMe and TVSum and achieve competitive performance against the state-of-the-art results.

[1]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Arnaldo de Albuquerque Araújo,et al.  VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method , 2011, Pattern Recognit. Lett..

[3]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[5]  Kaiyang Zhou,et al.  Deep Reinforcement Learning for Unsupervised Video Summarization with Diversity-Representativeness Reward , 2017, AAAI.

[6]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[7]  Bin Zhao,et al.  HSA-RNN: Hierarchical Structure-Adaptive RNN for Video Summarization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Yale Song,et al.  TVSum: Summarizing web videos using titles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[10]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[11]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[12]  Xuelong Li,et al.  Video Summarization With Attention-Based Encoder–Decoder Networks , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[13]  Ke Zhang,et al.  Video Summarization with Long Short-Term Memory , 2016, ECCV.

[14]  Baobao Chang,et al.  Graph-based Dependency Parsing with Bidirectional LSTM , 2016, ACL.

[15]  In-So Kweon,et al.  Discriminative Feature Learning for Unsupervised Video Summarization , 2018, AAAI.

[16]  Michael Lam,et al.  Unsupervised Video Summarization with Adversarial LSTM Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Xuelong Li,et al.  Hierarchical Recurrent Neural Network for Video Summarization , 2017, ACM Multimedia.